Entity-Linking-Tutorial

In this tutorial, we will implement a Bi-encoder based entity disambiguation system using the BC5CDR dataset and data from the MeSH knowledge base.
We will compare the surface-form based candidate generation with the Bi-encoder based one, to understand the power of Bi-encoder model in entity linking.

Docs for English

https://izuna385.medium.com/building-bi-encoder-based-entity-linking-system-with-transformer-6c111d86500

Docs for Japanese

Tutorial with Colab-Pro.

Environment Setup

First, create base environment with conda.

# If you don't use colab-pro, create environment from conda.
$ conda create -n allennlp python=3.7
$ conda activate allennlp
$ pip install -r requirements.txt

Preprocessing

First, download preprocessed files from here, then unzip.
Second, download BC5CDR dataset to ./dataset/ and unzip.
You have to place CDR_DevelopmentSet.PubTator.txt, CDR_TestSet.PubTator.txt and CDR_TrainingSet.PubTator.txt under ./dataset/.
Then, run python3 BC5CDRpreprocess.py and python3 preprocess_mesh.py.

Models and Scoring

Models

Surface-Candidate based
ANN-search based

Scoring

Default: Dot product between mention and predicted entity.
- Derived from [Logeswaran et al., '19]
L2-distance and cosine similarity are also supported.

Experiment and Evaluation

$ rm -r serialization_dir # Remove pre-experiment result if you run `python3 main.py -debug` for debugging.
$ python3 main.py

Parameters

We only here note critical parameters for training and evaluation. For further detail, see parameters.py.

Parameter Name	Description	Default
`batch_size_for_train`	Batch size during learning. The more there are, the more the encoder will learn to choose the correct answer from more negative examples.	`16`
`lr`	Learning rate.	`1e-5`
`max_candidates_num`	Determine how many candidates are to be generated for each mention by using surface form.	`5`
`search_method_for_faiss`	This specifies whether to use the cosine distance (`cossim`), inner product (`indexflatip`), or L2 distance (`indexflatl2`) when performing approximate neighborhood search.	`indexflatip`

Result

Surface-Candidate based recall

Generated Candidates Num 5 10 20

dev_recall 76.80 79.91 80.92

test_recall 74.35 77.14 78.25

Generated Candidates Num	5	10	20
dev_recall	76.80	79.91	80.92
test_recall	74.35	77.14	78.25

`batch_size_for_train: 16`

Surface-Candidate based acc.

Generated Candidates Num 5 10 20

dev_acc 59.85 52.56 47.23

test_acc 58.51 51.38 45.69
ANN-search Based

(Generated Candidates Num: 50 (Fixed))

Recall@X 1 (Acc.) 5 10 50

dev_recall 21.58 42.28 50.48 67.11

test_recall 21.50 40.29 47.95 64.52

Generated Candidates Num	5	10	20
dev_acc	59.85	52.56	47.23
test_acc	58.51	51.38	45.69

Recall@X	1 (Acc.)	5	10	50
dev_recall	21.58	42.28	50.48	67.11
test_recall	21.50	40.29	47.95	64.52

`batch_size_for_train: 48`

Surface-Candidate based acc.

Generated Candidates Num 5 10 20

dev_acc 72.39 68.21 65.40

test_acc 70.95 66.87 63.72
ANN-search Based

(Generated Candidates Num: 50 (Fixed))

Recall@X 1 (Acc.) 5 10 50

dev_recall 58.86 74.33 78.14 83.10

test_recall 57.66 73.14 76.73 81.39

Generated Candidates Num	5	10	20
dev_acc	72.39	68.21	65.40
test_acc	70.95	66.87	63.72

Recall@X	1 (Acc.)	5	10	50
dev_recall	58.86	74.33	78.14	83.10
test_recall	57.66	73.14	76.73	81.39

LICENSE

MIT

Related Projects

Zero-Shot-Entity-Linking

Zero-shot Entity Linking with blitz start in 3 minutes. Hard negative mining and encoder for all ...

31 Jul 2020 31

AceNAS

Open source implementation of AceNAS: https://arxiv.org/abs/2108.03001

06 Aug 2021 8

news-translit-nmt

Training scripts and instructions how to reproduce our systems submitted to the NEWS 2018 Task on...

18 Jan 2019 4

VAR

[GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregress...

01 Apr 2024 2,568

hybrid-discriminative-generative

Hybrid Discriminative-Generative Training via Contrastive Learning

17 Jul 2020 75

mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tune...

02 Feb 2023 3,760

BCEmbedding

Netease Youdao's open-source embedding and reranker models for RAG products.

02 Jan 2024 1,399

nlp

This repository recorded my NLP journey.

18 May 2018 1,073

ner-crf

CRF to detect named entities (primarily names of people)

22 Aug 2015 118

Dual-encoder-Entity-Retrieval-with-BERT

Re-implementation of Bi- (or, Dual-) encoder for Entity Linking. You can run experiments only in ...

11 May 2020 11

End-to-end-ASR-Pytorch

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation)...

08 Dec 2017 1,177

LinkBERT

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

30 Mar 2022 368

pyBPDL

Binary Pattern Dictionary Learning for gene activation in microscopy images

07 Jun 2017 8

LLM-Blender

[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently sup...

31 May 2023 869

QaNER

Unofficial implementation of QaNER: Prompting Question Answering Models for Few-shot Named Entity...

16 Jun 2022 62