Some text similarity utilities
MIT License
Some text similarity utilities
The goal of akin is to make it easy to sort text based on numeric similarity.
You can install this tool via pip.
python -m pip install "akin @ git+https://github.com/koaning/akin.git"
The simplest way to use this tool is to just use it to sort texts.
from akin import sort_dataframe
# Let's load in a csv file that has a text column named "text".
dataf = pd.read_csv("data.csv")
# Let's sort this dataframe such that we prefer examples with texts
# that are similar to the examples in the line below.
dataf.pipe(sort_dataframe, examples=["very nice", "super positive"], text_col="text")
In this basic setting, we're really just using CountVectors from scikit-learn to compute the similarity between two texts based on bag of word counts. We could go a bit more fancy though by using word embeddings from whatlies. Our library supports any embedding, as long as it's implemented with the scikit-learn API in mind.
from whatlies.language import BytePairLanguage
bp_lang = BytePairLanguage("en")
dataf.pipe(sort_dataframe,
examples=["very nice", "super positive"],
text_col="text",
featurizer=bp_lang)
While the sorting will likely cover most activated labelling use-cases, you
may also want an object that's a bit more flexible. For that you may use
the AkinClassifier
.
import pandas as pd
from akin import AkinClassifier
examples = {
"positive": ["thanks so much", "compliment", "i like this!"],
"negative": ["this stinks", "you suck"],
}
akin = AkinClassifier(examples=examples)
df = pd.read_csv("<some>/<file>.csv")
# Calculate distances for the original dataframe
akin.assign_distances(df)
# Predict a single item
akin.predict_single(text="thanks, that's nice of you")
# Construct a generator that yields the {text, distances} dictionary for each item
g = akin.pipe(df["text"])
next(g)
I like to build in public but I should stress that this is a repo made for utility for myself. Honestly, it's made in a quick evening. Feel free to re-use, but don't expect maintenance or production-quality code in the long term.