Home/ddangelov/Top2Vec/Changelog

ddangelov/Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.

28 Releases

Latest: 2y ago

hierarchical topic reduction improvements1.0.34Latest

ddangelov·2y ago·November 16, 2023

GitHub

📋 Changes

fixed loading bug
hierarchical topic reduction bug
added parameter for optimizing hierarchical reduction speed

Topic indexing bugfix1.0.33

ddangelov·2y ago·November 3, 2023

GitHub

1.0.32

ddangelov·2y ago·November 2, 2023

GitHub

Indexing bugfix

gpu hdbscan and topic indexing1.0.31

ddangelov·2y ago·November 2, 2023

GitHub

1. Added gpu hdsbcan 2. Added topic indexing

gpu umap1.0.30

ddangelov·2y ago·November 1, 2023

GitHub

1. Changed default embedding model to `universal-sentence-encoder-multilingual`. 2. Added option for GPU umap with `gpu_umap` parameter.

Adding compute_topics1.0.29

ddangelov·3y ago·March 14, 2023

GitHub

📋 Changes

Added a method for computing topics.
Exposed topic deduplication parameter `topic_merge_delta`.
Bug fixes.

Sklearn change in API fix 1.0.28

ddangelov·3y ago·January 25, 2023

GitHub

get_feature_names() -> get_feature_names_out()

Phrases and new embedding options1.0.27

ddangelov·4y ago·April 3, 2022

GitHub

📋 Changes

New pre-trained transformer models available
Ability to use any embedding model by passing callable to `embedding_model`
New `embedding_batch_size` option
Document chunking options for long documents
Phrases in topics by setting `ngram_vocab=True`

Query documents and topics fix1.0.26

ddangelov·4y ago·July 9, 2021

GitHub

Query documents and topics1.0.25

ddangelov·5y ago·June 23, 2021

GitHub

Added `query_documents` and `query_topics` methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics. Added `num_topics` parameter to `get_documents_topics` method which allows retrieving multiple topics per document.

gensim version fix1.0.24

ddangelov·5y ago·April 1, 2021

GitHub

Fixes #152

1.0.23

ddangelov·5y ago·February 12, 2021

GitHub

Added `numpy>=1.20.0` dependency.

1.0.22

ddangelov·5y ago·February 12, 2021

GitHub

Numpy related bug fix and document id validation performance upgrade.

added umap/hdbscan custom args1.0.21

ddangelov·5y ago·February 5, 2021

GitHub

Addressed #90, #125, #126 Added custom umap and hdbscan arg option. Fixed issue with loading model with custom tokenizer.

added use_embedding_model_tokenizer option1.0.20

ddangelov·5y ago·January 9, 2021

GitHub

Added `use_embedding_model_tokenizer` parameter. If set to `True` and if using an `embedding_model` other than `doc2vec`, use the model's tokenizer for document embedding. Fixed dependency issue with joblib. Fixed issues with wordclouds caused by negative similarity scores.

fix saving bug1.0.19

ddangelov·5y ago·December 10, 2020

GitHub

Fixed bug #91

word indexing1.0.18

ddangelov·5y ago·December 10, 2020

GitHub

Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically `search_words_by_vector` and `similar_words`. Added new method `search_words_by_vector`.

document indexing1.0.17

ddangelov·5y ago·December 7, 2020

GitHub

Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically `search_documents_by_vector`, `search_documents_by_keywords`, and `search_documents_by_documents`. Added new method `search_documents_by_vector`. Added code to prevent hierarchical topic reduction error #79.

Separate dependencies1.0.16

ddangelov·5y ago·November 10, 2020

GitHub

Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With `pip install top2vec[sentence-encoders]` and `pip install top2vec[sentence_transformers]` Faster cosine similarity.

logging bug fix and default change1.0.15

ddangelov·5y ago·October 16, 2020

GitHub

The `verbose` parameter will be set to True by default. Fixed a bug that stopped showing logging updates after downloading pre-trained models.

updated code documentation1.0.13

ddangelov·5y ago·October 15, 2020

GitHub

added pre-trained universal sentence encoder and BERT sentence transformer options1.0.12

ddangelov·5y ago·October 15, 2020

GitHub

Top2Vec now has an option to choose the embedding model with `doc2vec`, `universal-sentence-encoder`, `universal-sentence-encoder-multilingual`, and `distiluse-base-multilingual-cased` as the options. A `get_documents_topics` method was added.

added delete_documents methods and bug fixes1.0.11

ddangelov·5y ago·October 8, 2020

GitHub

Added a method for deleting documents from model. Fixed bug when using `corpus_file` that resulted in documents getting dropped. Fixed bug when using `add_documents` and `delete_documents` which resulted in improper ordering of topic words.

UMAP install bug fix1.0.10

ddangelov·5y ago·August 29, 2020

GitHub

There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional `min_count` parameter has been added, the default is still 50. All words with total frequency lower `min_count` are ignored by the model.

Hierarchical Topic Reduction 1.0.9

ddangelov·6y ago·June 26, 2020

GitHub

Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.

Custom document ids, tokenizer input, option to save documents1.0.8

ddangelov·6y ago·April 18, 2020

GitHub

Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.

Topic size and deduplication1.0.7

ddangelov·6y ago·April 7, 2020

GitHub

Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics. Topic deduplication is added to make topics more robust.

First Release1.0.6

ddangelov·6y ago·March 25, 2020

GitHub

Top2Vec initial release.

← Back to Top2Vec wiki