Dual-encoder-Entity-Retrieval-with-BERT

Re-implementation of Bi- (or, Dual-) encoder for Entity Linking. You can run experiments only in 3 seconds.

MIT License

Stars

11

Committers

View Code on GitHub View on X

Ecosystems: Python

Dual-encoder-with-BERT

For Quick-Start Experiment in 3sec.

git clone https://github.com/izuna385/Dual-encoder-with-BERT.git
cd Dual-encoder-with-BERT
python3 train.py -num_epochs 1

For further speednizing, you can use multi gpus.

CUDA_VISIBLE_DEVICES=0,1 python3 train.py -num_epochs 1 -cuda_devices 0,1

Description

Re-implementation of [Gillick et al., '19] and [Humeau et al., '20] 's bi-encoder.

You can run Bi-encoder based Entity Linking experiments with your own datasets.

Notes

This experiments are specifically for In-domain Entity Linking. For Zero-Shot one, see this repository.

Requirements

Packages

See requirements.txt. If allennlp is not installed to your local environments, follow Allennlp documentation.

Files

Entities

You need cui2idx.json, idx2cui.json, cui2cano.json, and cui2def.json for encoding entities of specified KB (, or, entity set).

cui2idx.json and idx2json

cui means one unique id for each entity, like D0002131 of United stated of America in DBpedia.

idx is integer for each cui.
cui2cano.json and cui2def.json

Canonical names specify entity name for each entity. Canonical names and Definitions (first sentence of definition is often used here) must be split to tokens.

Annotated mentions

You also needs annotated train/dev/test mentions. See ./mention_dump_dir/xxx/ for more details.

id2line.json This contains all annotated mentions including train, dev and test.
```
"0": "D000001\tPER\tHarry\tThe success of the books and films has allowed the <target> Harry Potter </target> franchise ..."
```
- "0" : mention uniq id.
- "D000001": Gold entity for each mention
- "PER": Type, like ORG, LOC, and MISC. You can use dummy tag because this type is not used for training.
- "Harry Potter": Raw mention string.
- "The success ...": One sentence which contains one mention. The mention is wrapped with special tokens, <target>,</target>.

How to run experiments immediately

For checking scripts with dummy datasets, run python3 train.py -num_epochs 1
- Linking evaluation is done with entire accuracy, not normalized one.

How to train Bi-encder with specific datasets?

Prepare entities mentioned above, and linking dataset.
- The required formats of datasets can be confirmed at ./dataset/ directory.

To-do list

Make dataset creation more easier.
Pip packaging.

LICENSE

MIT

Related Projects

Zero-Shot-Entity-Linking

Zero-shot Entity Linking with blitz start in 3 minutes. Hard negative mining and encoder for all ...

sentence-transformers

Multilingual Sentence & Image Embeddings with BERT

24 Jul 2019 13,925

QaNER

Unofficial implementation of QaNER: Prompting Question Answering Models for Few-shot Named Entity...

nlu_datasets

Datasets for intent classification and entity extraction including converters.

OpenNRE

An Open-Source Package for Neural Relation Extraction (NRE)

26 Feb 2017 4,260

LinkBERT

[ACL 2022] LinkBERT: A Knowledgeable Language Model 😎 Pretrained with Document Links

30 Mar 2022 368

TUPE

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding...

24 Jun 2020 249

ERNIE

Source code and dataset for ACL 2019 paper "ERNIE: Enhanced Language Representation with Informat...

17 May 2019 1,410

BertSum

Code for paper Fine-tune BERT for Extractive Summarization

25 Mar 2019 1,464

Entity-Linking-Tutorial

Bi-encoder Based Entity Linking Tutorial. You can run experiment only in 5 minutes. Experiments o...

LLM-Blender

[ACL2023] We introduce LLM-Blender, an innovative ensembling framework to attain consistently sup...

31 May 2023 869

jel

Japanese Entity Linker.

bio-re-with-entity-embeddings

Large-scale biomedical relation extraction with entity and pair embeddings

PA-TRP

Code for TKDE paper "Learning Relation Prototype from Unlabeled Texts for Long-tail Relation Extr...

LC-Rec