CRF to detect named entities (primarily names of people)
MIT License
This is an implementation using (linear chain) conditional random fields (CRF) in python 2.7 for named entity recognition (NER). It uses the python-crfsuite library as its basis. By default it can handle the labels PER
, LOC
, ORG
and MISC
, but was primarily optimized for PER
(recognition of names of people) in german, though it should be usable for any language. Scores are expected to be a bit lower for other labels than PER
, because the Gazetteer-feature currently only handles PER
labels. The implementation achieved an F1 score for PER
of 0.78
on the Germeval2014NER corpus (notice that german NER is significantly harder than english NER) and an F1 score of 0.87
(again PER
) on an automatically annotated Wikipedia corpus (it was trained on an excerpt of that Wikipedia corpus, so a higher score was expected as the Germeval2014Ner is partly different from Wikipedia's style of language).
The CRF implementation uses only local features (i.e. annotating John
at the top of a document with PER
has no influence on another John
at the bottom of the same document).
The used features are:
. , : ; ( ) [ ] ? !
-classes
flag to the word2vec tool)1
and 0
)PER
) that appear more often among the person names than among all words.John
becomes Aa+
, DARPA
becomes A+
1
(words outside the rank of 1000 just get a -1
).John
becomes Joh
.John
becomes ohn
.A large annotated corpus is required that (a) contains one article/document per line, (b) is tokenized (e.g. by the stanford parser) and (c) contains annotated named entities of the form word/LABEL
.
Example (each article shortened, german):
Ang/PER Lee/PER ( $foreign_language ; * 23 . Oktober 1954 in Pingtung/LOC , Taiwan/LOC ) ist ein US-amerikanisch-taiwanischer Filmregisseur , Drehbuchautor und Produzent . Er ist ... Actinium ( latinisiert von griechisch ακτίνα , aktína „ Strahl “ ) ist ein radioaktives chemisches Element mit dem Elementsymbol Ac und der Ordnungszahl 89 . Das Element ... Anschluss ist in der Soziologie ein Fachbegriff aus der Systemtheorie von Niklas/PER Luhmann/PER und bezeichnet die in einer sozialen Begegnung auf eine Selektion der ...
(Note: Github markdown eats up the linebreak after every ...
.)
Notice the /PER
and /LOC
labels. BIO codes will automatically be normalized to non-BIO codes (e.g. B-PER
becomes PER
or I-LOC
becomes LOC
).
You will also need word2vec clusters (can come from that corpus or a different one) and brown clusters (same).
Note: You can create a large corpus with annotated names of people from the Wikipedia as names (in Wikipedia articles) are often linked with articles about people, which are identifiable. There are some papers about that.
PER
, LOC
, ORG
, MISC
as described above at Corpus
. You can change these labels in config.py
, but PER
is required.-classes
for the word2vec tool to generate clusters instead of vectors. This should result in one file.paths
file.config.py
to match your settings. You will have to change ARTICLES_FILEPATH
(path to your corpus file), STANFORD_DIR
(root directory of the stanford pos tagger), STANFORD_POS_JAR_FILEPATH
(stanford pos tagger jar filepath, might be different for your version), STANFORD_MODEL_FILEPATH
(pos tagging model to use, default is german-fast
), W2V_CLUSTERS_FILEPATH
(filepath to your word2vec clusters), BROWN_CLUSTERS_FILEPATH
(filepath to your brown clusters paths
file), COUNT_WINDOWS_TRAIN
(number of examples to train on, might be too many for your corpus), COUNT_WINDOWS_TEST
(number of examples to test on, might be too many for your corpus), LABELS
(if you don't use PER, LOC, ORG, MISC as labels, PER though is a requirement).python -m preprocessing/collect_unigrams
to create lists of unigrams for your corpus. This will take 2 hours or so, especially if your corpus is large.python -m preprocessing/lda --dict --train
to train the LDA model. This will take 2 hours or so, especially if your corpus is large.python train.py --identifier="my_experiment"
to train a CRF model with name my_experiment
. This will likely run for several hours (it did when tested on 20,000 example windows). Notice that the feature generation will be very slow at the first run, as POS tagging and (to a lesser degree) LDA tagging take a lot of time.python test.py --identifier="my_experiment" --mycorpus
to test your trained CRF model on an excerpt of your corpus (by default on windows 0 to 4,000, while training happens on windows 4,000 to 24,000). This also requires feature generation and will therefore also be slow (at the first run).Results on the Germeval 2014 NER corpus:
| precision | recall | f1-score | support
----------------|-----------|----------|----------|---------- O | 0.97 | 1.00 | 0.98 | 23487 PER | 0.84 | 0.73 | 0.78 | 525 avg / total | 0.95 | 0.96 | 0.95 | 25002
Note: ~1000 tokens are missing, because they belonged to LOC, ORG or MISC. The CRF model was not really trained on these labels and therefore performed poorly. It was only properly trained on PER.
Results on an automatically annotated Wikipedia corpus (therefore some PER labels might have been wrong/missing):
| precision | recall | f1-score | support
----------------|-----------|----------|----------|---------- O | 0.97 | 0.98 | 0.98 | 182952 PER | 0.88 | 0.85 | 0.87 | 8854 avg / total | 0.95 | 0.95 | 0.95 | 199239
Note: Same as above, LOC, ORG and MISC were removed from the table.
MIT