About

This is an implementation using (linear chain) conditional random fields (CRF) in python 2.7 for named entity recognition (NER). It uses the python-crfsuite library as its basis. By default it can handle the labels PER, LOC, ORG and MISC, but was primarily optimized for PER (recognition of names of people) in german, though it should be usable for any language. Scores are expected to be a bit lower for other labels than PER, because the Gazetteer-feature currently only handles PER labels. The implementation achieved an F1 score for PER of 0.78 on the Germeval2014NER corpus (notice that german NER is significantly harder than english NER) and an F1 score of 0.87 (again PER) on an automatically annotated Wikipedia corpus (it was trained on an excerpt of that Wikipedia corpus, so a higher score was expected as the Germeval2014Ner is partly different from Wikipedia's style of language).

Used features

The CRF implementation uses only local features (i.e. annotating John at the top of a document with PER has no influence on another John at the bottom of the same document).

The used features are:

Whether to word starts with an uppercase letter
Chracter length of a word
Whether the word contains any digit (0-9)
Whether the word contains any punctuation, i.e. . , : ; ( ) [ ] ? !
Whether the word contains only digits
Whether the word contains only punctuation
The word2vec cluster of the word (add -classes flag to the word2vec tool)
The brown cluster of the word
The brown cluster bitchain of the word (i.e. the position of the word's brown cluster in the tree of all brown clusters represented by a string of 1 and 0)
Whether the word is contained in a Gazetteer of person names. The Gazetteer is created by scanning through an annotated corpus and collecting all names (words labeled with PER) that appear more often among the person names than among all words.
The word pattern of the word, e.g. John becomes Aa+, DARPA becomes A+
The unigram rank of the word among the 1000 most common words, where the most common word would get the rank 1 (words outside the rank of 1000 just get a -1).
The 3-character-prefix of the word, i.e. John becomes Joh.
The 3-chracter-suffix of the word, i.e. John becomes ohn.
The Part of Speech tag (POS) of the word as generated by the Stanford POS Tagger.
The LDA topic (among 100 topics) of a small window (-5, +5 words) around the word. The LDA topics are generated from the same corpus that is also used to train the CRF.

Requirements

Libraries/code

python 2.7 (only tested on that version)
python-crfsuite
scikit-learn (used in test to generate classification reports)
shelve (should be part of python)
gensim (for the LDA)
nltk (used for its wrapper of the stanford pos tagger)
Stanford pos tagger (must be downloaded and extracted somewhere)

Corpus

A large annotated corpus is required that (a) contains one article/document per line, (b) is tokenized (e.g. by the stanford parser) and (c) contains annotated named entities of the form word/LABEL. Example (each article shortened, german):

Ang/PER Lee/PER ( $foreign_language ; * 23 . Oktober 1954 in Pingtung/LOC , Taiwan/LOC ) ist ein US-amerikanisch-taiwanischer Filmregisseur , Drehbuchautor und Produzent . Er ist ... Actinium ( latinisiert von griechisch ακτίνα , aktína „ Strahl “ ) ist ein radioaktives chemisches Element mit dem Elementsymbol Ac und der Ordnungszahl 89 . Das Element ... Anschluss ist in der Soziologie ein Fachbegriff aus der Systemtheorie von Niklas/PER Luhmann/PER und bezeichnet die in einer sozialen Begegnung auf eine Selektion der ...

(Note: Github markdown eats up the linebreak after every ....) Notice the /PER and /LOC labels. BIO codes will automatically be normalized to non-BIO codes (e.g. B-PER becomes PER or I-LOC becomes LOC).

You will also need word2vec clusters (can come from that corpus or a different one) and brown clusters (same).

Note: You can create a large corpus with annotated names of people from the Wikipedia as names (in Wikipedia articles) are often linked with articles about people, which are identifiable. There are some papers about that.

Usage

Create a large annotated corpus with the labels PER, LOC, ORG, MISC as described above at Corpus. You can change these labels in config.py, but PER is required.
Generate word2vec clusters from a large corpus (I used 1000 clusters from 300-component vectors, skipgram, min count 50, window size 10). Use the flag -classes for the word2vec tool to generate clusters instead of vectors. This should result in one file.
Generate brown clusters from a large corpus (I used 1000 clusters, min count 12). This should result in several files, including a paths file.
Install all requirements including the stanford pos tagger
Change all constants (specifically the filepaths) in config.py to match your settings. You will have to change ARTICLES_FILEPATH (path to your corpus file), STANFORD_DIR (root directory of the stanford pos tagger), STANFORD_POS_JAR_FILEPATH (stanford pos tagger jar filepath, might be different for your version), STANFORD_MODEL_FILEPATH (pos tagging model to use, default is german-fast), W2V_CLUSTERS_FILEPATH (filepath to your word2vec clusters), BROWN_CLUSTERS_FILEPATH (filepath to your brown clusters paths file), COUNT_WINDOWS_TRAIN (number of examples to train on, might be too many for your corpus), COUNT_WINDOWS_TEST (number of examples to test on, might be too many for your corpus), LABELS (if you don't use PER, LOC, ORG, MISC as labels, PER though is a requirement).
Run python -m preprocessing/collect_unigrams to create lists of unigrams for your corpus. This will take 2 hours or so, especially if your corpus is large.
Run python -m preprocessing/lda --dict --train to train the LDA model. This will take 2 hours or so, especially if your corpus is large.
Run python train.py --identifier="my_experiment" to train a CRF model with name my_experiment. This will likely run for several hours (it did when tested on 20,000 example windows). Notice that the feature generation will be very slow at the first run, as POS tagging and (to a lesser degree) LDA tagging take a lot of time.
Run python test.py --identifier="my_experiment" --mycorpus to test your trained CRF model on an excerpt of your corpus (by default on windows 0 to 4,000, while training happens on windows 4,000 to 24,000). This also requires feature generation and will therefore also be slow (at the first run).

Score

Results on the Germeval 2014 NER corpus:

            | precision |   recall | f1-score |  support

----------------|-----------|----------|----------|---------- O | 0.97 | 1.00 | 0.98 | 23487 PER | 0.84 | 0.73 | 0.78 | 525 avg / total | 0.95 | 0.96 | 0.95 | 25002

Note: ~1000 tokens are missing, because they belonged to LOC, ORG or MISC. The CRF model was not really trained on these labels and therefore performed poorly. It was only properly trained on PER.

Results on an automatically annotated Wikipedia corpus (therefore some PER labels might have been wrong/missing):

            | precision |   recall | f1-score |  support

----------------|-----------|----------|----------|---------- O | 0.97 | 0.98 | 0.98 | 182952 PER | 0.88 | 0.85 | 0.87 | 8854 avg / total | 0.95 | 0.95 | 0.95 | 199239

Note: Same as above, LOC, ORG and MISC were removed from the table.