Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.

BSD-3-CLAUSE License

Downloads
11.9K
Stars
2.9K
Committers
2

Bot releases are hidden (Show)

Top2Vec - updated code documentation

Published by ddangelov about 4 years ago

Top2Vec now has an option to choose the embedding model with doc2vec, universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased as the options.

A get_documents_topics method was added.

Top2Vec - added delete_documents methods and bug fixes

Published by ddangelov about 4 years ago

Added a method for deleting documents from model.

Fixed bug when using corpus_file that resulted in documents getting dropped. Fixed bug when using add_documents and delete_documents which resulted in improper ordering of topic words.

Top2Vec - UMAP install bug fix

Published by ddangelov about 4 years ago

There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional min_count parameter has been added, the default is still 50. All words with total frequency lower min_count are ignored by the model.

Top2Vec - Hierarchical Topic Reduction

Published by ddangelov over 4 years ago

Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.

Top2Vec - Custom document ids, tokenizer input, option to save documents

Published by ddangelov over 4 years ago

Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.

Top2Vec - Topic size and deduplication

Published by ddangelov over 4 years ago

Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics.

Topic deduplication is added to make topics more robust.

Top2Vec - First Release

Published by ddangelov over 4 years ago

Top2Vec initial release.

Package Rankings
Top 25.39% on Conda-forge.org
Top 2.62% on Pypi.org
Badges
Extracted from project README
Related Projects