Top2Vec

Top2Vec learns jointly embedded topic, document and word vectors.

BSD-3-CLAUSE License

Downloads

11.9K

Stars

2.9K

Committers

View Code on GitHub

Ecosystems: Python

Bot releases are visible (Hide)

Top2Vec - hierarchical topic reduction improvements Latest Release

Published by ddangelov 11 months ago

fixed loading bug
hierarchical topic reduction bug
added parameter for optimizing hierarchical reduction speed

Top2Vec - Topic indexing bugfix

Published by ddangelov 12 months ago

Top2Vec -

Published by ddangelov 12 months ago

Indexing bugfix

Top2Vec - gpu hdbscan and topic indexing

Published by ddangelov 12 months ago

Added gpu hdsbcan
Added topic indexing

Top2Vec - gpu umap

Published by ddangelov 12 months ago

Changed default embedding model to universal-sentence-encoder-multilingual.
Added option for GPU umap with gpu_umap parameter.

Top2Vec - Adding compute_topics

Published by ddangelov over 1 year ago

Added a method for computing topics.
Exposed topic deduplication parameter topic_merge_delta.
Bug fixes.

Top2Vec - Sklearn change in API fix

Published by ddangelov over 1 year ago

get_feature_names() -> get_feature_names_out()

Top2Vec - Phrases and new embedding options

Published by ddangelov over 2 years ago

New pre-trained transformer models available
Ability to use any embedding model by passing callable to embedding_model
New embedding_batch_size option
Document chunking options for long documents
Phrases in topics by setting ngram_vocab=True

Top2Vec - Query documents and topics fix

Published by ddangelov over 3 years ago

Top2Vec - Query documents and topics

Published by ddangelov over 3 years ago

Added query_documents and query_topics methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics.

Added num_topics parameter to get_documents_topics method which allows retrieving multiple topics per document.

Top2Vec - gensim version fix

Published by ddangelov over 3 years ago

Fixes #152

Top2Vec -

Published by ddangelov over 3 years ago

Added numpy>=1.20.0 dependency.

Top2Vec -

Published by ddangelov over 3 years ago

Numpy related bug fix and document id validation performance upgrade.

Top2Vec - added umap/hdbscan custom args

Published by ddangelov over 3 years ago

Addressed #90, #125, #126

Added custom umap and hdbscan arg option. Fixed issue with loading model with custom tokenizer.

Top2Vec - added use_embedding_model_tokenizer option

Published by ddangelov almost 4 years ago

Added use_embedding_model_tokenizer parameter. If set to True and if using an embedding_model other than doc2vec, use the model's tokenizer for document embedding.

Fixed dependency issue with joblib.

Fixed issues with wordclouds caused by negative similarity scores.

Top2Vec - fix saving bug

Published by ddangelov almost 4 years ago

Fixed bug #91

Top2Vec - word indexing

Published by ddangelov almost 4 years ago

Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically search_words_by_vector and similar_words.

Added new method search_words_by_vector.

Top2Vec - document indexing

Published by ddangelov almost 4 years ago

Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically search_documents_by_vector, search_documents_by_keywords, and search_documents_by_documents.

Added new method search_documents_by_vector.

Added code to prevent hierarchical topic reduction error #79.

Top2Vec - Separate dependencies

Published by ddangelov almost 4 years ago

Dependencies for universal sentence encoder and BERT sentence transformer options are now optional.
With pip install top2vec[sentence-encoders] and pip install top2vec[sentence_transformers]

Faster cosine similarity.

Top2Vec - logging bug fix and default change

Published by ddangelov about 4 years ago

The verbose parameter will be set to True by default.

Fixed a bug that stopped showing logging updates after downloading pre-trained models.

Package Rankings

Top 25.39% on Conda-forge.org

Top 2.62% on Pypi.org

Badges

Extracted from project README

Related Projects

ctrl

Conditional Transformer Language Model for Controllable Generation

29 Aug 2019 1,866

DocumentSearchEngine

Document Search Engine project with TF-IDF abd Google universal sentence encoder model

Article-Summarizer

Uses frequency analysis to summarize text.

04 Jan 2017 183

Data-science

Collection of useful data science topics along with articles, videos, and code

17 Jul 2020 4,031

magnitude

A fast, efficient universal vector embedding utility package.

24 Feb 2018 1,624

subreddit-analyzer

A comprehensive Data and Text Mining workflow for submissions and comments from any given public ...

17 Dec 2019 489

leeky

leeky - training data contamination techniques for blackbox models

Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2

02 Jul 2021 1,323

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

summarizer

A Reddit bot that summarizes news articles written in Spanish or English. It uses a custom built ...

10 Feb 2019 269

ir-using-kg

Keyphrase Generation for Scientific Document Retrieval

vid2cleantxt

Python API & command-line tool to easily transcribe speech-based video files into clean text

09 Mar 2021 186

sense2vec

🦆 Contextually-keyed word vectors

23 Jan 2016 1,617

nlp

This repository recorded my NLP journey.

18 May 2018 1,073

BERTopic

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

22 Sep 2020 5,670