NLP Cheat Sheet, Python, spacy, LexNPL, NLTK, tokenization, stemming, sentence detection, named entity recognition
Introduction to Natural Language Processing (NLP) tools, frameworks, concepts, resources for Python
nltk.download()
Uncased model is better unless you know that case information is important for your task (e.g., Named Entity Recognition or Part-of-Speech tagging)
positive
, neutral
, negative
) to a product reviewbug
, feature request
) to a customer support ticketfrom nltk.corpus import gutenberg
Evaluate models performance on language understanding tasks, such as question answering and text summarization.
spacy (good for beginners; use NLTK for bigger projects)
pip install spacy
python -m spacy download en
# python -m spacy download en_core_web_lg
LexNLP (good for dealing with legal and financial documents; installation guide here)
pip install https://github.com/LexPredict/lexpredict-lexnlp/archive/master.zip
python # to open REPL console
>>> import nltk
>>> nltk.download() # download all packages
Visualizing word vectors using PCA (link to paper).
Example: one-hot encoded vector
numbers
word is represented as one-hot encoded vector = [0,0,0,0,0,1]converted
= [0,0,0,1,0,0]Example: word-embeddings
Credits: Deep Learning with Python
There are two ways to obtain word embeddings:
Pretrained word embeddings
When parts of a model are pretrained (like your Embedding layer) and parts are randomly initialized (like your classifier), the pretrained parts shouldnât be updated during training, to avoid forgetting what they already know.
How OpenAI GPT-3 embeddings compare to Google and Sentence-Transformer embeddings.
SentenceTransformers is a Python framework for state-of-the-art sentence, text and image embeddings. The initial work is described in the paper Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
Install it with:
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
# Sentences to encode
sentences = [
"This framework generates embeddings for each input sentence.",
"Sentences are passed as a list of string.",
"The quick brown fox jumps over the lazy dog."
]
embeddings = model.encode(sentences)
for sentence, embedding in zip(sentences, embeddings):
print("Sentence:", sentence)
print("Embedding:", embedding[:15])
print("Embedding dimension", embedding.shape)
print("")
Sentence: This framework generates embeddings for each input sentence.
Embedding: [-0.01195314 -0.05562933 -0.00824256 0.00889048 0.02768425 0.1139881
0.01469875 -0.03189586 0.04145184 -0.08188552 0.01413268 -0.0203336
0.04077511 0.02262853 -0.04784386]
Embedding dimension (384,)
Sentence: Sentences are passed as a list of string.
Embedding: [ 0.0564525 0.05500239 0.03137959 0.03394853 -0.03542475 0.08346675
0.09888012 0.00727544 -0.00668658 -0.0076581 0.07937384 0.00073965
0.01492921 -0.01510471 0.03676743]
Embedding dimension (384,)
Sentence: The quick brown fox jumps over the lazy dog.
Embedding: [ 0.04393354 0.05893442 0.04817837 0.07754811 0.02674442 -0.03762956
-0.0026051 -0.05994309 -0.002496 0.02207284 0.04802594 0.05575529
-0.03894543 -0.0266168 0.0076934 ]
Embedding dimension (384,)
Fine-tuning pre-trained GloVe embeddings with Keras.
from keras.models import Sequential
from keras.layers import Embedding, Flatten, Dense
# download from https://nlp.stanford.edu/projects/glove
f = open('./glove.6B.100d.txt')
embeddings_index = {}
for line in f:
values = line.split()
word = values[0]
coefs = np.asarray(values[1:], dtype='float32')
embeddings_index[word] = coefs
f.close()
print('Found %s word vectors.' % len(embeddings_index))
embedding_dim = 100
maxlen = 100
training_samples = 200
validation_samples = 10000
max_words = 10000
embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
if i < max_words:
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
# option A)
model.add(Flatten())
model.add(Dense(32, activation='relu'))
# option B) LSTM
#model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))
model.summary()
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['acc'])
history = model.fit(x_train, y_train,
epochs=10,
batch_size=32,
validation_data=(x_val, y_val))
model.save_weights('pre_trained_glove_model.h5')
You can also train the same model without loading the pretrained word embeddings and without freezing the embedding layer. In that case, youâll learn a task-specific embedding of the input tokens, which is generally more powerful than pretrained word embeddings when lots of data is available.
The Universal Sentence Encoder (Cer et al., 2018) (USE) is a model that encodes text into 512-dimensional embeddings.
import tensorflow as tf
import tensorflow_hub as hub
import numpy as np
module_url = "https://tfhub.dev/google/universal-sentence-encoder/4"
model = hub.load(module_url)
def embed(input):
return model(input)
word = "Elephant"
sentence = "I am a sentence for which I would like to get its embedding."
paragraph = (
"Universal Sentence Encoder embeddings also support short paragraphs. "
"There is no hard limit on how long the paragraph is. Roughly, the longer "
"the more 'diluted' the embedding will be.")
messages = [word, sentence, paragraph]
message_embeddings = embed(messages)
for i, message_embedding in enumerate(np.array(message_embeddings).tolist()):
print("Message: {}".format(messages[i]))
print("Embedding size: {}".format(len(message_embedding)))
message_embedding_snippet = ", ".join(
(str(x) for x in message_embedding[:3]))
print("Embedding: [{}, ...]\n".format(message_embedding_snippet))
Message: Elephant
Embedding size: 512
Embedding: [0.008344486355781555, 0.00048085825983434916, 0.06595248728990555, ...]
Message: I am a sentence for which I would like to get its embedding.
Embedding size: 512
Embedding: [0.050808604806661606, -0.016524329781532288, 0.01573779620230198, ...]
Message: Universal Sentence Encoder embeddings also support short paragraphs. There is no hard limit on how long the paragraph is. Roughly, the longer the more 'diluted' the embedding will be.
Embedding size: 512
Embedding: [-0.028332693502306938, -0.0558621808886528, -0.012941480614244938, ...]
import gensim
word2vev_model = gensim.models.word2vec.Word2Vec(sentence_list)
Term Frequency - Inverse Document Frequency. The goal is to determine a TF-IDF vector for each document. The vectors are then used to calculate the similarity between documents.
a
, the
, is
) having IDF close to zero.The TF is calculated for a term t
in a document d
. Hence, every term in every document has a TF and we need to calculate TFs for every term in every document.
The IDF score is calculated once for each term t
occuring in the corpus.
Combining these two, we get the TF-IDF score w
for a term t
in a document d
:
(sklearn) in Python has a function TfidfVectorizer()
that will compute the TF-IDF values for you.
The document size of our corpus is N=4.
The 6 unique terms in our corpus are dog
, bites
, man
, eats
, meat
, food
.
Let's determine the TF-IDF scores w
for all terms in the first document Dog bites man.
Terms | TF | IDF Score | w = TF * IDF |
---|---|---|---|
dog | 1/3 | log(4/3) | 0.138 |
bites | 1/3 | log(4/2) | 0.33 |
man | 1/3 | log(4/3) | 0.138 |
eats | 0 | log(4/2) | 0 |
meat | 0 | log(4/1) | 0 |
food | 0 | log(4/1) | 0 |
The corresponding TF-IDF vector for document D1
is:
[0.138, 0.33, 0.138, 0, 0, 0]
Consider a document containing 100 words wherein the word cat appears 3 times. The term frequency (TF) for cat is then (3 / 100) = 0.03.
Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (IDF) is calculated as log(10,000,000 / 1,000) = 4.
Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12
from sklearn.feature_extraction.text import TfidfVectorizer
document_corpus = [
"Dog bites man",
"Man bites dog",
"Dog eats meat",
"Man eats food"
]
tfidf = TfidfVectorizer()
bow_rep_tfidf = tfidf.fit_transform(document_corpus)
print("IDF for all words in the vocabulary")
print(tfidf.idf_)
print("\nAll words in the vocabulary.")
print(tfidf.get_feature_names_out())
temp = tfidf.transform(["Dog bites man"])
print("\nTF-IDF representation for 'Dog bites man':\n", temp.toarray())
IDF for all words in the vocabulary
[1.51082562 1.22314355 1.51082562 1.91629073 1.22314355 1.91629073]
All words in the vocabulary.
['bites' 'dog' 'eats' 'food' 'man' 'meat']
TF-IDF representation for 'Dog bites man':
[[0.65782931 0.53256952 0. 0. 0.53256952 0. ]]
Notice that the TF-IDF scores that we calculated for our corpus doesn't match the TF-IDF scores given by scikit-learn. This is because scikit-learn uses a slightly modified version of the IDF formula. This stems from provisions to account for possible zero divisions and to not entirely ignore terms that appear in all documents.
TD-IDF with N-Gram
from sklearn.feature_extraction.text import TfidfVectorizer
import re
document_corpus = [
"Dog bites man",
"Man bites dog",
"Dog eats meat",
"Man eats food"
]
# Write a function for cleaning strings and returning an array of ngrams
def ngrams_analyzer(string):
string = re.sub(r'[,-./]', r'', string)
ngrams = zip(*[string[i:] for i in range(5)]) # N-Gram length is 5
return [''.join(ngram) for ngram in ngrams]
# Construct your vectorizer for building the TF-IDF matrix
tfidf = TfidfVectorizer(analyzer=ngrams_analyzer)
bow_rep_tfidf = tfidf.fit_transform(document_corpus)
print("IDF for all words in the vocabulary")
print(tfidf.idf_)
print("\nAll words in the vocabulary.")
print(tfidf.get_feature_names_out())
temp = tfidf.transform(["Dog bites man"])
print("\nTF-IDF representation for 'Dog bites man':\n", temp.toarray())
# Credits: https://towardsdatascience.com/group-thousands-of-similar-spreadsheet-text-cells-in-seconds-2493b3ce6d8d
IDF for all words in the vocabulary
[1.51082562 1.51082562 1.91629073 1.91629073 1.91629073 1.91629073
1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.91629073
1.51082562 1.51082562 1.91629073 1.91629073 1.91629073 1.91629073
1.51082562 1.91629073 1.91629073 1.91629073 1.91629073 1.91629073
1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.91629073
1.91629073]
All words in the vocabulary.
[' bite' ' eats' ' food' ' meat' 'Dog b' 'Dog e' 'Man b' 'Man e' 'an bi'
'an ea' 'ats f' 'ats m' 'bites' 'eats ' 'es do' 'es ma' 'g bit' 'g eat'
'ites ' 'n bit' 'n eat' 'og bi' 'og ea' 's dog' 's foo' 's man' 's mea'
'tes d' 'tes m' 'ts fo' 'ts me']
TF-IDF representation for 'Dog bites man':
[[0.28113163 0. 0. 0. 0.35657982 0.
0. 0. 0. 0. 0. 0.
0.28113163 0. 0. 0.35657982 0.35657982 0.
0.28113163 0. 0. 0.35657982 0. 0.
0. 0.35657982 0. 0. 0.35657982 0.
0. ]]
Words that are similar to each other will tend to co-occur together.
Letâs call the context of the word, the two words that surround a specific word by each side. For example, in a sentence I ate a peach yesterday
, the word peach
is surrounded by the words: ate
, a
, yesterday
.
To build a co-occurrence matrix, one has to start with the full vocabulary of words in a specific corpus.
Example
Letâs imagine some simple sentences:
The vocabulary of our group of sentences is:
a, ate, banana, beach, car, in, is, Iâm, jeep, my, riding, to, the
Our co-occurence vector will be of of size 13, where 13 is the number of distinct words in our vocabulary.
The initialized co-occurence vector for the word car
is:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
In our sentences, the word car
shows up in 3 sentences:
The highlighted words co-occur with the word car
, i.e. the highlights represent the two words before and two words after car
.
The co-occurence vector for car
is:
# Vocabulary + co-occurence vector below
a, ate, banana, beach, car, in, is, Iâm, jeep, my, riding, to, the
[2, 0, 0, 0, 0, 1, 2, 0, 0, 2, 0, 1, 1]
Each number represents the number of occurences in the context of the word. For example, a
appears twice, whereas ate
didn't appear at all.
Skip â gram follows the same topology as of CBOW. It just flips CBOWâs architecture on its head. The aim of skip-gram is to predict the context given a word
# John likes to watch movies. Mary likes movies too.
BoW1 = {"John":1,"likes":2,"to":1,"watch":1,"movies":2,"Mary":1,"too":1};
import spacy
# Import dataset
nlp = spacy.load("en")
# Import large dataset. Needs to be downloaded first.
# nlp = spacy.load("en_core_web_lg")
Stop words are the very common words like âifâ, âbutâ, âweâ, âheâ, âsheâ, and âtheyâ. We can usually remove these words without changing the semantics of a text and doing so often (but not always) improves the performance of a model.
# spacy: Removing stop words
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
print('spacy: Number of stop words: %d' % len(spacy_stopwords))
spacy: Number of stop words: 326
# nltk: Removing stop words
from nltk.corpus import stopwords
english_stop_words = stopwords.words('english')
print('ntlk: Number of stop words: %d' % len(english_stop_words))
ntlk: Number of stop words: 179
text = 'Larry Page founded Google in early 1990.'
doc = nlp(text)
tokens = [token.text for token in doc if not token.is_stop]
print('Original text: %s' % (text))
print()
print(tokens)
Original text: Larry Page founded Google in early 1990.
['Larry', 'Page', 'founded', 'Google', 'early', '1990', '.']
Part of a given text. So doc[2:4] is a span starting at token 2, up to â but not including! â token 4.
Docs: https://spacy.io/api/span
doc = nlp("Larry Page founded Google in early 1990.")
span = doc[2:4]
span.text
'founded Google'
[(spans) for spans in doc]
[Larry, Page, founded, Google, in, early, 1990, .]
Segmenting text into words, punctuation etc.
Docs: https://spacy.io/api/token
doc = nlp("Larry Page founded Google in early 1990.")
[token.text for token in doc]
['Larry', 'Page', 'founded', 'Google', 'in', 'early', '1990', '.']
# Load OpenAI GPT-2 using PyTorch Transformers
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')
# https://huggingface.co/pytorch-transformers/serialization.html
SentencePiece is a tokenization algorithm for the preprocessing of text.
Unicode normalization (such as NFC or NFKC), can also be applied by tokenizer.
HTML tokenizers
Segments and labels multi-token sequences.
Credits: https://www.nltk.org/book/ch07.html
Chink is a sequence of tokens that is not included in a chunk.
Credits: https://www.nltk.org/book/ch07.html
Assigning word types to tokens like verb or noun.
POS tagging should be done straight after tokenization and before any words are removed so that sentence structure is preserved and it is more obvious what part of speech the word belongs to.
text = "Asian shares skidded on Tuesday after a rout in tech stocks put Wall Street to the sword"
doc = nlp(text)
[(x.orth_, x.pos_, spacy.explain(x.pos_)) for x in [token for token in doc]]
[('Asian', 'ADJ', 'adjective'),
('shares', 'NOUN', 'noun'),
('skidded', 'VERB', 'verb'),
('on', 'ADP', 'adposition'),
('Tuesday', 'PROPN', 'proper noun'),
('after', 'ADP', 'adposition'),
('a', 'DET', 'determiner'),
('rout', 'NOUN', 'noun'),
('in', 'ADP', 'adposition'),
('tech', 'NOUN', 'noun'),
('stocks', 'NOUN', 'noun'),
('put', 'VERB', 'verb'),
('Wall', 'PROPN', 'proper noun'),
('Street', 'PROPN', 'proper noun'),
('to', 'ADP', 'adposition'),
('the', 'DET', 'determiner'),
('sword', 'NOUN', 'noun')]
[(x.orth_, x.tag_, spacy.explain(x.tag_)) for x in [token for token in doc]]
[('Asian', 'JJ', 'adjective'),
('shares', 'NNS', 'noun, plural'),
('skidded', 'VBD', 'verb, past tense'),
('on', 'IN', 'conjunction, subordinating or preposition'),
('Tuesday', 'NNP', 'noun, proper singular'),
('after', 'IN', 'conjunction, subordinating or preposition'),
('a', 'DT', 'determiner'),
('rout', 'NN', 'noun, singular or mass'),
('in', 'IN', 'conjunction, subordinating or preposition'),
('tech', 'NN', 'noun, singular or mass'),
('stocks', 'NNS', 'noun, plural'),
('put', 'VBD', 'verb, past tense'),
('Wall', 'NNP', 'noun, proper singular'),
('Street', 'NNP', 'noun, proper singular'),
('to', 'IN', 'conjunction, subordinating or preposition'),
('the', 'DT', 'determiner'),
('sword', 'NN', 'noun, singular or mass')]
# using nltk
import nltk
tokens = nltk.word_tokenize(text)
pos_tags = nltk.pos_tag(tokens)
pos_tags
[('Asian', 'JJ'),
('shares', 'NNS'),
('skidded', 'VBN'),
('on', 'IN'),
('Tuesday', 'NNP'),
('after', 'IN'),
('a', 'DT'),
('rout', 'NN'),
('in', 'IN'),
('tech', 'JJ'),
('stocks', 'NNS'),
('put', 'VBD'),
('Wall', 'NNP'),
('Street', 'NNP'),
('to', 'TO'),
('the', 'DT'),
('sword', 'NN')]
[(token, token.ent_iob_, token.ent_type_) for token in doc]
[(Asian, 'B', 'NORP'),
(shares, 'O', ''),
(skidded, 'O', ''),
(on, 'O', ''),
(Tuesday, 'B', 'DATE'),
(after, 'O', ''),
(a, 'O', ''),
(rout, 'O', ''),
(in, 'O', ''),
(tech, 'O', ''),
(stocks, 'O', ''),
(put, 'O', ''),
(Wall, 'O', ''),
(Street, 'O', ''),
(to, 'O', ''),
(the, 'O', ''),
(sword, 'O', '')]
Stemming is the process of reducing words to their root form.
Examples:
There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers
import nltk
from nltk.stem.porter import *
stemmer = PorterStemmer()
tokens = ['compute', 'computer', 'computed', 'computing']
for token in tokens:
print(token + ' --> ' + stemmer.stem(token))
compute --> comput
computer --> comput
computed --> comput
computing --> comput
Assigning the base form of word, for example:
doc = nlp("Was Google founded in early 1990?")
[(x.orth_, x.lemma_) for x in [token for token in doc]]
[('Was', 'be'),
('Google', 'Google'),
('founded', 'found'),
('in', 'in'),
('early', 'early'),
('1990', '1990'),
('?', '?')]
Finding and segmenting individual sentences.
doc = nlp("Larry Page founded Google in early 1990. Sergey Brin joined.")
[sent.text for sent in doc.sents]
['Larry Page founded Google in early 1990.', 'Sergey Brin joined.']
Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.
doc = nlp("We are reading a text.")
# Dependency labels
[(x.orth_, x.dep_, spacy.explain(x.dep_)) for x in [token for token in doc]]
[('We', 'nsubj', 'nominal subject'),
('are', 'aux', 'auxiliary'),
('reading', 'ROOT', None),
('a', 'det', 'determiner'),
('text', 'dobj', 'direct object'),
('.', 'punct', 'punctuation')]
# Syntactic head token (governor)
[token.head.text for token in doc]
['reading', 'reading', 'reading', 'text', 'reading', 'reading']
doc = nlp("I have a red car")
[chunk.text for chunk in doc.noun_chunks]
['I', 'a red car']
What is NER? Labeling "real-world" objects, like persons, companies or locations.
2 popular approaches:
Datasets:
Credits: https://medium.com/@yingbiao/ner-with-bert-in-action-936ff275bc73
Entities supported by spacy:
LexNLP entities:
Stanford NER entities:
NLTK
Transformer Models (on HuggingFace)
doc = nlp("Larry Page founded Google in the US in early 1990.")
# Text and label of named entity span
[(ent.text, ent.label_) for ent in doc.ents]
[('Larry Page', 'PERSON'),
('Google', 'ORG'),
('US', 'GPE'),
('early 1990', 'DATE')]
doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices')
[(X.text, X.label_) for X in doc.ents]
[('European', 'NORP'),
('Google', 'ORG'),
('$5.1 billion', 'MONEY'),
('Wednesday', 'DATE')]
from collections import Counter
labels = [x.label_ for x in doc.ents]
Counter(labels)
Counter({'NORP': 1, 'ORG': 1, 'MONEY': 1, 'DATE': 1})
[(X, X.ent_iob_, X.ent_type_) for X in doc]
[(European, 'B', 'NORP'),
(authorities, 'O', ''),
(fined, 'O', ''),
(Google, 'B', 'ORG'),
(a, 'O', ''),
(record, 'O', ''),
($, 'B', 'MONEY'),
(5.1, 'I', 'MONEY'),
(billion, 'I', 'MONEY'),
(on, 'O', ''),
(Wednesday, 'B', 'DATE'),
(for, 'O', ''),
(abusing, 'O', ''),
(its, 'O', ''),
(power, 'O', ''),
(in, 'O', ''),
(the, 'O', ''),
(mobile, 'O', ''),
(phone, 'O', ''),
(market, 'O', ''),
(and, 'O', ''),
(ordered, 'O', ''),
(the, 'O', ''),
(company, 'O', ''),
(to, 'O', ''),
(alter, 'O', ''),
(its, 'O', ''),
(practices, 'O', '')]
# Show Begin and In entities
items = [x.text for x in doc.ents]
print(items)
Counter(items).most_common(3)
['European', 'Google', '$5.1 billion', 'Wednesday']
[('European', 1), ('Google', 1), ('$5.1 billion', 1)]
import lexnlp.extract.en as lexnlp
import nltk
text = "There are ten cows in the 2 acre pasture."
print(list(lexnlp.amounts.get_amounts(text)))
[10, 2.0]
import lexnlp.extract.en.acts
text = "test section 12 of the VERY Important Act of 1954."
lexnlp.extract.en.acts.get_act_list(text)
[{'location_start': 5,
'location_end': 49,
'act_name': 'VERY Important Act',
'section': '12',
'year': '1954',
'ambiguous': False,
'value': 'section 12 of the VERY Important Act of 1954'}]
Source: https://huggingface.co/flair/ner-english-large
Available tags:
pip install flair
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = SequenceTagger.load("flair/ner-english-large")
# make example sentence
sentence = Sentence("George Washington went to Washington")
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('The following NER tags are found:')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
Sentence: "George Washington went to Washington" â ["George Washington"/PER, "Washington"/LOC]
The following NER tags are found:
Span[0:2]: "George Washington" â PER (1.0)
Span[4:5]: "Washington" â LOC (1.0)
text = "We are the platform of choice for customers' SAP workloads in the cloud, companies like Thabani, Munich Re's, Sodexo, Volvo Cars, all run SAP on Azure. We are the only cloud provider with direct and secure access to Oracle databases running an Oracle Cloud infrastructure, making it possible for companies like FedEx, GE, and Marriott to use capabilities from both companies. And with Azure Confidential Computing, we're enabling companies in highly regulated industries, including RBC, to bring their most sensitive applications to the cloud. Just last week, UBS said it will move more than 50% of its applications to Azure."
# make example sentence
sentence = Sentence(text)
# predict NER tags
tagger.predict(sentence)
# print sentence
print(sentence)
# print predicted NER spans
print('\nThe following NER tags are found:\n')
# iterate over entities and print
for entity in sentence.get_spans('ner'):
print(entity)
Sentence: "We are the platform of choice for customers' SAP workloads in the cloud , companies like Thabani , Munich Re 's , Sodexo , Volvo Cars , all run SAP on Azure . We are the only cloud provider with direct and secure access to Oracle databases running an Oracle Cloud infrastructure , making it possible for companies like FedEx , GE , and Marriott to use capabilities from both companies . And with Azure Confidential Computing , we 're enabling companies in highly regulated industries , including RBC , to bring their most sensitive applications to the cloud . Just last week , UBS said it will move more than 50 % of its applications to Azure ." â ["SAP"/ORG, "Thabani"/ORG, "Munich Re"/ORG, "Sodexo"/ORG, "Volvo Cars"/ORG, "SAP"/ORG, "Azure"/MISC, "Oracle"/ORG, "Oracle Cloud"/MISC, "FedEx"/ORG, "GE"/ORG, "Marriott"/ORG, "Azure Confidential Computing"/MISC, "RBC"/ORG, "UBS"/ORG, "Azure"/MISC]
The following NER tags are found:
Span[8:9]: "SAP" â ORG (0.9945)
Span[16:17]: "Thabani" â ORG (1.0)
Span[18:20]: "Munich Re" â ORG (0.9604)
Span[22:23]: "Sodexo" â ORG (1.0)
Span[24:26]: "Volvo Cars" â ORG (1.0)
Span[29:30]: "SAP" â ORG (0.9995)
Span[31:32]: "Azure" â MISC (0.9974)
Span[45:46]: "Oracle" â ORG (0.9997)
Span[49:51]: "Oracle Cloud" â MISC (1.0)
Span[59:60]: "FedEx" â ORG (1.0)
Span[61:62]: "GE" â ORG (1.0)
Span[64:65]: "Marriott" â ORG (1.0)
Span[74:77]: "Azure Confidential Computing" â MISC (0.999)
Span[88:89]: "RBC" â ORG (1.0)
Span[104:105]: "UBS" â ORG (1.0)
Span[117:118]: "Azure" â MISC (0.9993)
Two types:
Assigning categories or labels to a whole document, or parts of a document.
Approach:
# Credits: https://pytorch.org/tutorials/beginner/text_sentiment_ngrams_tutorial.html
import re
from torchtext.data.utils import ngrams_iterator
from torchtext.data.utils import get_tokenizer
ag_news_label = {1 : "World",
2 : "Sports",
3 : "Business",
4 : "Sci/Tec"}
def predict(text, model, vocab, ngrams):
tokenizer = get_tokenizer("basic_english")
with torch.no_grad():
text = torch.tensor([vocab[token]
for token in ngrams_iterator(tokenizer(text), ngrams)])
output = model(text, torch.tensor([0]))
return output.argmax(1).item() + 1
ex_text_str = "MEMPHIS, Tenn. â Four days ago, Jon Rahm was \
enduring the seasonâs worst weather conditions on Sunday at The \
Open on his way to a closing 75 at Royal Portrush, which \
considering the wind and the rain was a respectable showing. \
Thursdayâs first round at the WGC-FedEx St. Jude Invitational \
was another story. With temperatures in the mid-80s and hardly any \
wind, the Spaniard was 13 strokes better in a flawless round. \
Thanks to his best putting performance on the PGA Tour, Rahm \
finished with an 8-under 62 for a three-stroke lead, which \
was even more impressive considering heâd never played the \
front nine at TPC Southwind."
vocab = train_dataset.get_vocab()
model = model.to("cpu")
print("This is a %s news" %ag_news_label[predict(ex_text_str, model, vocab, 2)])
# Output: This is a Sports news
We can calculate the similarity between two (and more):
A
vs a
cat
vs cats
co-chief
vs -chief
I love cats
vs I love dogs
Two types of measures exist returing one of two outputs:
True
or False
(binary), i.e. two inputs are exactly the same or they are not - nothing in between.cat
vs cats
: 3 out of 4 (75%) characters are the same, 1 out of 4 (25%) is different.It makes only sense to use type 1.
metrics when the inputs are always of equal length. For example, using any type 1.
metric to compare cat
and cats
always results in "not equal". On the other hand, type 2.
metrics are applicable to both, equal and non-equal length inputs.
> Overview of semantic similarity
Similarity metrics are mostly calculated with vector representations of words, sentences or documents. For example,
The following metrics assume two inputs having equal length. If two inputs don't have the same length, they can be normalizable by using padding characters.
Hamming distance is measured between two strings of equal length and defined as the number of positions that have different characters.
Manhatten distance (L1 norm, city block distance, taxicab) counts the number of mismatches by subtracting the difference between each pair of characters at each position of two strings.
Euclidean distance (L2 norm) is defined as the shortest straight-line distance between two points.
The following metrics work with inputs of equal and non-equal length.
Cosine distance (L2-normalized dot product of vectors) measures the similarity by using the normalized length of two input vectors. Order of characters/words are not taken into account.
Jaccard similarity indicates how many words two documents share by using the intersection and unions of the words.
Levenshtein distance measures the minimum number of edits needed to transform one input into the other. Considers order of characters or words in input.
Jaro Winkler distance minimum edit distance, considers prefixes.
Okapi BM25(F) ranking takes token distributions across corpus into account.
q_idf * dot(q_tf, d_tf[i]) * 1.5 (dot(q_tf, d_tf[i]) + .25 + .75 * d_num_words[i] / d_num_words.mean()))
A document, sentence or word is represented as a vector and the Cosine sim calculates the angle (="similarity") between two vectors.
The resulting similarity ranges from:
Similarity measurements for:
Figure below shows three word vectors and Cosine distance (=similarity) between
doc_1 = "Data is the oil of the digital economy"
doc_2 = "Data is a new oil"
# Vector representation of the document
doc_1_vector = [1, 1, 1, 1, 0, 1, 1, 2]
doc_2_vector = [1, 0, 0, 1, 1, 0, 1, 0]
Cosine sim with scikit: sklearn.metrics.pairwise.cosine_similarity
Jaccard sim is calculated by dividing the number of words occuring in both documents/sentences (intersection) with the number of all words in both docs/sents (union).
doc_1 = "Data is the new oil of the digital economy"
doc_2 = "Data is a new oil"
Each sentence is tokenized into words.
words_doc1 = {'data', 'is', 'the', 'new', 'oil', 'of', 'digital', 'economy'}
words_doc2 = {'data', 'is', 'a', 'new', 'oil'}
Four words occur in both sentences (intersection): data, is, new, oil.
Across both sentences, nine unique words exist (union): data, a, of, is, economy, the, new, digital, oil.
Visualized:
Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
The distance can also be calculated for sentences, i.e. the minimum number of single token edits required to change one sentence into the other.
Levenshtein distance for two strings a
, b
of length |a|
and |b|
is given by lev
:
Levenshtein distance between HONDA
and HYUNDAI
is 3 because it takes three transformations to change HONDA
to HYUNDAI
:
Y
to HONDA
=> HYONDA
O
with U
in HYONDA
=> HYUNDA
I
to HYUNDA
=> HYUNDAI
Lev distance between two sentences I love cats
and I love dogs
is 1.
Step 1: tokenize both sentences to ["I", "love", "cats"]
, ["I", "love", "docs"]
Step 2: perform one transformation, i.e. replace cats
with dogs
.
Length of a word vector. Also known as Euclidean norm.
Example:
doc1 = nlp("I like cats")
doc2 = nlp("I like dogs")
# Compare 2 documents
doc1.similarity(doc2)
0.957709143352323
# "cats" vs "dogs"
doc1[2].similarity(doc2[2])
0.83117634
# "I" vs "like dogs"
doc1[0].similarity(doc2[1:3])
0.46475163
doc = nlp("I like cats")
# L2 norm of "I like cats"
doc.vector_norm
4.706799587675896
# L2 norm of "cats"
doc[2].vector_norm
6.933004
# Vector representation of "cats"
doc[2].vector
array([-0.26763 , 0.029846 , -0.3437 , -0.54409 , -0.49919 ,
0.15928 , -0.35278 , -0.2036 , 0.23482 , 1.5671 ,
-0.36458 , -0.028713 , -0.27053 , 0.2504 , -0.18126 ,
0.13453 , 0.25795 , 0.93213 , -0.12841 , -0.18505 ,
-0.57597 , 0.18538 , -0.19147 , -0.38465 , 0.21656 ,
-0.4387 , -0.27846 , -0.41339 , 0.37859 , -0.2199 ,
-0.25907 , -0.019796 , -0.31885 , 0.12921 , 0.22168 ,
0.32671 , 0.46943 , -0.81922 , -0.20031 , 0.013561 ,
-0.14663 , 0.14438 , 0.0098044, -0.15439 , 0.21146 ,
-0.28409 , -0.4036 , 0.45355 , 0.12173 , -0.11516 ,
-0.12235 , -0.096467 , -0.26991 , 0.028776 , -0.11307 ,
0.37219 , -0.054718 , -0.20297 , -0.23974 , 0.86271 ,
0.25602 , -0.3064 , 0.014714 , -0.086497 , -0.079054 ,
-0.33109 , 0.54892 , 0.20076 , 0.28064 , 0.037788 ,
0.0076729, -0.0050123, -0.11619 , -0.23804 , 0.33027 ,
0.26034 , -0.20615 , -0.35744 , 0.54125 , -0.3239 ,
0.093441 , 0.17113 , -0.41533 , 0.13702 , -0.21765 ,
-0.65442 , 0.75733 , 0.359 , 0.62492 , 0.019685 ,
0.21156 , 0.28125 , 0.22288 , 0.026787 , -0.1019 ,
0.11178 , 0.17202 , -0.20403 , -0.01767 , -0.34351 ,
0.11926 , 0.73156 , 0.11094 , 0.12576 , 0.64825 ,
-0.80004 , 0.62074 , -0.38557 , 0.015614 , 0.2664 ,
0.18254 , 0.11678 , 0.58919 , -1.0639 , -0.29969 ,
0.14827 , -0.42925 , -0.090766 , 0.12313 , -0.024253 ,
-0.21265 , -0.10331 , 0.91988 , -1.4097 , -0.0542 ,
-0.071201 , 0.66878 , -0.24651 , -0.46788 , -0.23991 ,
-0.14138 , -0.038911 , -0.48678 , 0.22975 , 0.36074 ,
0.13024 , -0.40091 , 0.19673 , 0.016017 , 0.30575 ,
-2.1901 , -0.55468 , 0.26955 , 0.63815 , 0.42724 ,
-0.070186 , -0.11196 , 0.14079 , -0.022228 , 0.070456 ,
0.17229 , 0.099383 , -0.12258 , -0.23416 , -0.26525 ,
-0.088991 , -0.061554 , 0.26582 , -0.53112 , -0.4106 ,
0.45211 , -0.39669 , -0.43746 , -0.6632 , -0.048135 ,
0.23171 , -0.37665 , -0.38261 , -0.29286 , -0.036613 ,
0.25354 , 0.49775 , 0.3359 , -0.11285 , -0.17228 ,
0.85991 , -0.34081 , 0.27959 , 0.03698 , 0.61782 ,
0.23739 , -0.32049 , -0.073717 , 0.015991 , -0.37395 ,
-0.4152 , 0.049221 , -0.3137 , 0.091128 , -0.38258 ,
-0.036783 , 0.10902 , -0.38332 , -0.74754 , 0.016473 ,
0.55256 , -0.29053 , -0.50617 , 0.83599 , -0.31783 ,
-0.77465 , -0.0049272, -0.17103 , -0.38067 , 0.44987 ,
-0.12497 , 0.60263 , -0.12026 , 0.37368 , -0.079952 ,
-0.15785 , 0.37684 , -0.18679 , 0.18855 , -0.4759 ,
-0.11708 , 0.36999 , 0.54134 , 0.42752 , 0.038618 ,
0.043483 , 0.31435 , -0.24491 , -0.67818 , -0.33833 ,
0.039218 , -0.11964 , 0.8474 , 0.09451 , 0.070523 ,
-0.2806 , 0.296 , -0.17554 , -0.41087 , 0.70748 ,
0.17686 , 0.043479 , -0.31902 , 0.64584 , -0.45268 ,
-0.7967 , 0.099817 , -0.1734 , 0.11404 , -0.36809 ,
0.12035 , -0.048582 , 0.55945 , -0.51508 , 0.072704 ,
0.18106 , 0.07802 , -0.31526 , 0.38189 , 0.092801 ,
-0.044227 , -0.66154 , -0.020428 , 0.059836 , -0.23628 ,
-0.017592 , -0.56481 , -0.52934 , -0.16392 , 0.077331 ,
0.24583 , -0.32195 , -0.36811 , -0.037208 , 0.26702 ,
-0.57907 , 0.46457 , -0.54636 , 0.11855 , 0.092475 ,
-0.10469 , 0.03319 , 0.62616 , -0.33684 , 0.045742 ,
0.25089 , 0.28973 , 0.060633 , -0.4096 , 0.39198 ,
0.58276 , 0.496 , -0.75881 , 0.13655 , 0.21704 ,
-0.37978 , -0.54051 , -0.22813 , 0.28393 , -0.58739 ,
1.0472 , -0.13318 , -0.07325 , 0.12991 , -0.44999 ],
dtype=float32)
# can also be done using sklearn's linear kernel (equivilant to cosine similarity)
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
def get_top_n_words(corpus, n=None):
vec = CountVectorizer(stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
document_corpus = [
"Dog bites man",
"Dog bites man after man eats fish",
"Dog bites fish",
"Man bites dog",
"Dog eats meat",
"Man eats food",
"Man eats fish"
]
common_words = get_top_n_words(document_corpus, 5) # or use df['desc']
df2 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df2.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='Top 5 words in document corpus')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbae1ff3510>
def get_top_n_bigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(2, 2), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
document_corpus = [
"Dog bites man",
"Dog bites man after man eats fish",
"Dog bites fish",
"Man bites dog",
"Dog eats meat",
"Man eats food",
"Man eats fish"
]
common_words = get_top_n_bigram(document_corpus, 5)
df4 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df4.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='Top 5 bigrams in our corpus after removing stop words')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbae2086950>
def get_top_n_trigram(corpus, n=None):
vec = CountVectorizer(ngram_range=(3, 3), stop_words='english').fit(corpus)
bag_of_words = vec.transform(corpus)
sum_words = bag_of_words.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1], reverse=True)
return words_freq[:n]
document_corpus = [
"Dog bites man",
"Dog bites man after man eats fish",
"Dog bites fish",
"Man bites dog",
"Dog eats meat",
"Man eats food",
"Man eats fish"
]
common_words = get_top_n_trigram(document_corpus, 5)
df6 = pd.DataFrame(common_words, columns = ['desc' , 'count'])
df6.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='Top 5 trigrams in our corpus after removing stop words')
<matplotlib.axes._subplots.AxesSubplot at 0x7fbae20fb150>
from spacy import displacy
doc = nlp("This is a sentence")
displacy.render(doc, style="dep")
doc = nlp("Larry Page founded Google in the US in early 1990.")
displacy.render(doc, style="ent")
Inspired by: https://www.datacamp.com/community/blog/spacy-cheatsheet
Used by
Useful for
Also known as
Opposite of kernels: vectors
Source:
Linear Kernel is used when the data is Linearly separable, that is, it can be separated using a single Line.
Compute the linear kernel between X and Y: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.linear_kernel.html#sklearn.metrics.pairwise.linear_kernel
Credits: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
k-nearest neighbors algoritm
Useful for
Is text fact or opinion? Only perform sentiment analysis on opinion, not facts.
Sentiments:
2 ways:
Sentiment analysis can be performed with ntlk's SentimentIntensityAnalyzer
See: https://www.nltk.org/api/nltk.sentiment.html#module-nltk.sentiment.vader
Learning resources:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
# nltk.download('vader_lexicon')
/Users/jan/PycharmProjects/playground/nlp-cheat-sheet/venv/lib/python3.6/site-packages/nltk/twitter/__init__.py:20: UserWarning: The twython library has not been installed. Some functionality from the twitter package will not be available.
warnings.warn("The twython library has not been installed. "
A classification model that uses a sigmoid function to convert a linear model's raw prediction () into a value between 0 and 1. You can interpret the value between 0 and 1 in either of the following two ways:
https://developers.google.com/machine-learning/glossary/#logistic-regression
Recurrent neural networks
Long Short-Term Mermoy
ToDo
from keras.layers.recurrent import LSTM
from keras.models import Sequential
from keras.layers.core import Dense, Activation, Dropout, RepeatVector, Merge, TimeDistributedDense
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-63-61823dfb33e8> in <module>
----> 1 from keras.layers.recurrent import LSTM
2 from keras.models import Sequential
3 from keras.layers.core import Dense, Activation, Dropout, RepeatVector, Merge, TimeDistributedDense
ModuleNotFoundError: No module named 'keras'
import Levenshtein
Credits: https://towardsdatascience.com/how-to-train-custom-word-embeddings-using-gpu-on-aws-f62727a1e3f6
A measure of how far a model's predictions are from its label.
In contrast to:
Mean Squared Error (MSE) is a common loss function used for regression problems.
Mean squared error of an estimator measures the average of the squares of the errorsâthat is, the average squared difference between the estimated values and the actual value.
Can be used for regression problems (say, to predict the price of a house).
Alternatives:
Used in binary classification tasks, ie model outputs a probability (a single-unit layer with a sigmoid activation), we'll use the binary_crossentropy loss function.
Used in image classification task
Used in logistic regression tasks
This is how the model is updated based on the data it sees and its loss function.
Optimization algorithm for finding the minimum of a function.
A function (for example, ReLU or sigmoid) that takes in the weighted sum of all of the inputs from the previous layer and then generates and passes an output value (typically nonlinear) to the next layer.
https://developers.google.com/machine-learning/glossary/#activation_function
A function that provides probabilities for each possible class in a multi-class classification model. The probabilities add up to exactly 1.0. For example, softmax might determine that the probability of a particular image being a dog at 0.9, a cat at 0.08, and a horse at 0.02.
Example: last layer is a 10-node softmax layerâthis returns an array of 10 probability scores that sum to 1.
A function that maps logistic or multinomial regression output (log odds) to probabilities, returning a value between 0 and 1
Sigmoid function converts /sigma into a probability between 0 and 1.
Used when taining a neural network.
TP/(TP+FP)
TP/(TP+FN)
(2 Ă Precision Ă Recall) / (Precision + Recall)
A common regression metric is Mean Absolute Error (MAE).
Early stopping is a useful technique to prevent overfitting.
penalizes weights in proportion to the sum of the absolute values of the weights
https://developers.google.com/machine-learning/glossary/#L1_regularization
penalizes weights in proportion to the sum of the squares of the weights
The number of elements set to zero (or null) in a vector or matrix divided by the total number of entries in that vector or matrix.
Used by Reddit to rank comments.
https://spacy.io/models/en#en_pytt_xlnetbasecased_lg
A confusion matrix is a table where each cell [i,j]
indicates how often label j
was predicted when the correct label was i
.
Information gain measures how much more organized the input values become when we divide them up using a given feature. To measure how disorganized the original set of input values are, we calculate entropy of their labels, which will be high if the input values have highly varied labels,
Entropy is defined as the sum of the probability of each label times the log probability of that same label:
Labels that have low frequency do not contribute much to the entropy (since P(l) is small), and labels with high frequency also do not contribute much to the entropy (since log2P(l) is small).
Credits: Natural Language Processing with Python book
Credits: https://www.nltk.org/book/ch06.html