GitPedia
ddangelov

ddangelov/Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.

28 Releases
Latest: 2y ago
hierarchical topic reduction improvements1.0.34Latest
ddangelovddangelov·2y ago·November 16, 2023
GitHub

📋 Changes

  • fixed loading bug
  • hierarchical topic reduction bug
  • added parameter for optimizing hierarchical reduction speed
Topic indexing bugfix1.0.33
ddangelovddangelov·2y ago·November 3, 2023
GitHub
1.0.32
ddangelovddangelov·2y ago·November 2, 2023
GitHub

Indexing bugfix

gpu hdbscan and topic indexing1.0.31
ddangelovddangelov·2y ago·November 2, 2023
GitHub

1. Added gpu hdsbcan 2. Added topic indexing

gpu umap1.0.30
ddangelovddangelov·2y ago·November 1, 2023
GitHub

1. Changed default embedding model to `universal-sentence-encoder-multilingual`. 2. Added option for GPU umap with `gpu_umap` parameter.

Adding compute_topics1.0.29
ddangelovddangelov·3y ago·March 14, 2023
GitHub

📋 Changes

  • Added a method for computing topics.
  • Exposed topic deduplication parameter `topic_merge_delta`.
  • Bug fixes.
Sklearn change in API fix 1.0.28
ddangelovddangelov·3y ago·January 25, 2023
GitHub

get_feature_names() -> get_feature_names_out()

Phrases and new embedding options1.0.27
ddangelovddangelov·4y ago·April 3, 2022
GitHub

📋 Changes

  • New pre-trained transformer models available
  • Ability to use any embedding model by passing callable to `embedding_model`
  • New `embedding_batch_size` option
  • Document chunking options for long documents
  • Phrases in topics by setting `ngram_vocab=True`
Query documents and topics fix1.0.26
ddangelovddangelov·4y ago·July 9, 2021
GitHub
Query documents and topics1.0.25
ddangelovddangelov·5y ago·June 23, 2021
GitHub

Added `query_documents` and `query_topics` methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics. Added `num_topics` parameter to `get_documents_topics` method which allows retrieving multiple topics per document.

gensim version fix1.0.24
ddangelovddangelov·5y ago·April 1, 2021
GitHub

Fixes #152

1.0.23
ddangelovddangelov·5y ago·February 12, 2021
GitHub

Added `numpy>=1.20.0` dependency.

1.0.22
ddangelovddangelov·5y ago·February 12, 2021
GitHub

Numpy related bug fix and document id validation performance upgrade.

added umap/hdbscan custom args1.0.21
ddangelovddangelov·5y ago·February 5, 2021
GitHub

Addressed #90, #125, #126 Added custom umap and hdbscan arg option. Fixed issue with loading model with custom tokenizer.

added use_embedding_model_tokenizer option1.0.20
ddangelovddangelov·5y ago·January 9, 2021
GitHub

Added `use_embedding_model_tokenizer` parameter. If set to `True` and if using an `embedding_model` other than `doc2vec`, use the model's tokenizer for document embedding. Fixed dependency issue with joblib. Fixed issues with wordclouds caused by negative similarity scores.

fix saving bug1.0.19
ddangelovddangelov·5y ago·December 10, 2020
GitHub

Fixed bug #91

word indexing1.0.18
ddangelovddangelov·5y ago·December 10, 2020
GitHub

Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically `search_words_by_vector` and `similar_words`. Added new method `search_words_by_vector`.

document indexing1.0.17
ddangelovddangelov·5y ago·December 7, 2020
GitHub

Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically `search_documents_by_vector`, `search_documents_by_keywords`, and `search_documents_by_documents`. Added new method `search_documents_by_vector`. Added code to prevent hierarchical topic reduction error #79.

Separate dependencies1.0.16
ddangelovddangelov·5y ago·November 10, 2020
GitHub

Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With `pip install top2vec[sentence-encoders]` and `pip install top2vec[sentence_transformers]` Faster cosine similarity.

logging bug fix and default change1.0.15
ddangelovddangelov·5y ago·October 16, 2020
GitHub

The `verbose` parameter will be set to True by default. Fixed a bug that stopped showing logging updates after downloading pre-trained models.

updated code documentation1.0.13
ddangelovddangelov·5y ago·October 15, 2020
GitHub
added pre-trained universal sentence encoder and BERT sentence transformer options1.0.12
ddangelovddangelov·5y ago·October 15, 2020
GitHub

Top2Vec now has an option to choose the embedding model with `doc2vec`, `universal-sentence-encoder`, `universal-sentence-encoder-multilingual`, and `distiluse-base-multilingual-cased` as the options. A `get_documents_topics` method was added.

added delete_documents methods and bug fixes1.0.11
ddangelovddangelov·5y ago·October 8, 2020
GitHub

Added a method for deleting documents from model. Fixed bug when using `corpus_file` that resulted in documents getting dropped. Fixed bug when using `add_documents` and `delete_documents` which resulted in improper ordering of topic words.

UMAP install bug fix1.0.10
ddangelovddangelov·5y ago·August 29, 2020
GitHub

There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional `min_count` parameter has been added, the default is still 50. All words with total frequency lower `min_count` are ignored by the model.

Hierarchical Topic Reduction 1.0.9
ddangelovddangelov·6y ago·June 26, 2020
GitHub

Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.

Custom document ids, tokenizer input, option to save documents1.0.8
ddangelovddangelov·6y ago·April 18, 2020
GitHub

Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.

Topic size and deduplication1.0.7
ddangelovddangelov·6y ago·April 7, 2020
GitHub

Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics. Topic deduplication is added to make topics more robust.

First Release1.0.6
ddangelovddangelov·6y ago·March 25, 2020
GitHub

Top2Vec initial release.