Bot releases are hidden (Show)
keybert.KeyLLM
to leverage LLMs for extracting keywords 🔥
A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match KeyBERT with KeyLLM. You could also choose to use KeyLLM without KeyBERT.
from keybert import KeyBERT
kw_model = KeyBERT()
# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer
# Extract embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, convert_to_tensor=True)
# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()
# Load it in KeyLLM
kw_model = KeyLLM(llm)
# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.75)
This is the best of both worlds. We use KeyBERT to generate a first pass of keywords and embeddings and give those to KeyLLM for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior with the threshold. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.
import openai
from keybert.llm import OpenAI
from keybert import KeyLLM, KeyBERT
# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()
# Load it in KeyLLM
kw_model = KeyBERT(llm=llm)
# Extract keywords
keywords = kw_model.extract_keywords(documents); keywords
See here for full documentation on use cases of KeyLLM
and here for the implemented Large Language Models.
Published by MaartenGr almost 2 years ago
from keybert import KeyBERT
kw_model = KeyBERT()
# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)
# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)
Do note that the parameters passed to .extract_embeddings
for creating the vectorizer should be exactly the same as those in .extract_keywords
.
candidates
not working (#122)Published by MaartenGr about 2 years ago
from keybert import KeyBERT
from transformers.pipelines import pipeline
hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)
CountVectorizer
for creating the tokensNOTE: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!
Published by MaartenGr over 2 years ago
CountVectorizer
and KeyphraseVectorizers
KeyphraseVectorizers
package can be found here
Published by MaartenGr about 3 years ago
Highlights:
Miscellaneous:
Published by MaartenGr over 3 years ago
paraphrase-MiniLM-L6-v2
as the default (great results!)keywords = kw_model.extract_keywords(doc, highlight=True)
Published by MaartenGr over 3 years ago
The two main features are candidate keywords and several backends to use instead of Flair and SentenceTransformers!
Highlights:
KeyBERT().extract_keywords(doc, candidates)
Fixes:
Miscellaneous:
Published by MaartenGr over 3 years ago
Published by MaartenGr over 3 years ago
This release is meant as a way to create a DOI through Zenodo.
Published by MaartenGr almost 4 years ago
Added Max Sum Similarity as an option to diversify your results.
Published by MaartenGr almost 4 years ago
This first release includes keyword/keyphrase extraction using BERT and simple cosine similarity. There is also an option to use Maximal Marginal Relevance to select the candidate keywords/keyphrases.