In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.
We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.
See here.
# If you don't use colab-pro, create environment from conda.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt
First, download preprocessed files from here, then unzip.
Second, download BC5CDR dataset to ./dataset/
and unzip.
You have to place CDR_DevelopmentSet.PubTator.txt
, CDR_TestSet.PubTator.txt
and CDR_TrainingSet.PubTator.txt
under ./dataset/
.
Then, run python3 BC5CDRpreprocess.py
and python3 preprocess_mesh.py
.
Surface-Candidate based
ANN-search based
Default: Dot product between mention and predicted entity.
L2-distance and cosine similarity are also supported.
$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.
$ python3 main.py
We only here note critical parameters for training and evaluation. For further detail, see parameters.py
.
Parameter Name | Description | Default |
---|---|---|
batch_size_for_train |
Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples. | 16 |
lr |
Learning rate. | 1e-5 |
max_candidates_num |
Determine how many candidates are to be generated for each mention by using surface form. | 5 |
search_method_for_faiss |
This specifies whether to use the cosine distance (cossim ), inner product (indexflatip ), or L2 distance (indexflatl2 ) when performing approximate neighborhood search. |
indexflatip |
Surface-Candidate based recall
Generated Candidates Num | 5 | 10 | 20 |
---|---|---|---|
dev_recall | 76.80 | 79.91 | 80.92 |
test_recall | 74.35 | 77.14 | 78.25 |
batch_size_for_train: 16
Surface-Candidate based acc.
Generated Candidates Num | 5 | 10 | 20 |
---|---|---|---|
dev_acc | 59.85 | 52.56 | 47.23 |
test_acc | 58.51 | 51.38 | 45.69 |
ANN-search Based
(Generated Candidates Num: 50 (Fixed))
Recall@X | 1 (Acc.) | 5 | 10 | 50 |
---|---|---|---|---|
dev_recall | 21.58 | 42.28 | 50.48 | 67.11 |
test_recall | 21.50 | 40.29 | 47.95 | 64.52 |
batch_size_for_train: 48
Surface-Candidate based acc.
Generated Candidates Num | 5 | 10 | 20 |
---|---|---|---|
dev_acc | 72.39 | 68.21 | 65.40 |
test_acc | 70.95 | 66.87 | 63.72 |
ANN-search Based
(Generated Candidates Num: 50 (Fixed))
Recall@X | 1 (Acc.) | 5 | 10 | 50 |
---|---|---|---|---|
dev_recall | 58.86 | 74.33 | 78.14 | 83.10 |
test_recall | 57.66 | 73.14 | 76.73 | 81.39 |
MIT