ddangelov/Top2Vec
Top2Vec learns jointly embedded topic, document and word vectors.
📋 Changes
- fixed loading bug
- hierarchical topic reduction bug
- added parameter for optimizing hierarchical reduction speed
Indexing bugfix
1. Added gpu hdsbcan 2. Added topic indexing
1. Changed default embedding model to `universal-sentence-encoder-multilingual`. 2. Added option for GPU umap with `gpu_umap` parameter.
📋 Changes
- Added a method for computing topics.
- Exposed topic deduplication parameter `topic_merge_delta`.
- Bug fixes.
get_feature_names() -> get_feature_names_out()
📋 Changes
- New pre-trained transformer models available
- Ability to use any embedding model by passing callable to `embedding_model`
- New `embedding_batch_size` option
- Document chunking options for long documents
- Phrases in topics by setting `ngram_vocab=True`
Added `query_documents` and `query_topics` methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics. Added `num_topics` parameter to `get_documents_topics` method which allows retrieving multiple topics per document.
Fixes #152
Added `numpy>=1.20.0` dependency.
Numpy related bug fix and document id validation performance upgrade.
Addressed #90, #125, #126 Added custom umap and hdbscan arg option. Fixed issue with loading model with custom tokenizer.
Added `use_embedding_model_tokenizer` parameter. If set to `True` and if using an `embedding_model` other than `doc2vec`, use the model's tokenizer for document embedding. Fixed dependency issue with joblib. Fixed issues with wordclouds caused by negative similarity scores.
Fixed bug #91
Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically `search_words_by_vector` and `similar_words`. Added new method `search_words_by_vector`.
Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically `search_documents_by_vector`, `search_documents_by_keywords`, and `search_documents_by_documents`. Added new method `search_documents_by_vector`. Added code to prevent hierarchical topic reduction error #79.
Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With `pip install top2vec[sentence-encoders]` and `pip install top2vec[sentence_transformers]` Faster cosine similarity.
The `verbose` parameter will be set to True by default. Fixed a bug that stopped showing logging updates after downloading pre-trained models.
Top2Vec now has an option to choose the embedding model with `doc2vec`, `universal-sentence-encoder`, `universal-sentence-encoder-multilingual`, and `distiluse-base-multilingual-cased` as the options. A `get_documents_topics` method was added.
Added a method for deleting documents from model. Fixed bug when using `corpus_file` that resulted in documents getting dropped. Fixed bug when using `add_documents` and `delete_documents` which resulted in improper ordering of topic words.
There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional `min_count` parameter has been added, the default is still 50. All words with total frequency lower `min_count` are ignored by the model.
Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.
Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.
Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics. Topic deduplication is added to make topics more robust.
Top2Vec initial release.
