git clone https://github.com/izuna385/Dual-encoder-with-BERT.git
cd Dual-encoder-with-BERT
python3 train.py -num_epochs 1
For further speednizing, you can use multi gpus.
CUDA_VISIBLE_DEVICES=0,1 python3 train.py -num_epochs 1 -cuda_devices 0,1
Re-implementation of [Gillick et al., '19] and [Humeau et al., '20] 's bi-encoder.
See requirements.txt
.
If allennlp
is not installed to your local environments, follow Allennlp documentation.
You need cui2idx.json
, idx2cui.json
, cui2cano.json
, and cui2def.json
for encoding entities of specified KB (, or, entity set).
cui2idx.json
and idx2json
cui means one unique id for each entity, like D0002131
of United stated of America
in DBpedia.
idx is integer for each cui.
cui2cano.json
and cui2def.json
Canonical names specify entity name for each entity. Canonical names and Definitions (first sentence of definition is often used here) must be split to tokens.
You also needs annotated train/dev/test mentions.
See ./mention_dump_dir/xxx/
for more details.
id2line.json This contains all annotated mentions including train, dev and test.
"0": "D000001\tPER\tHarry\tThe success of the books and films has allowed the <target> Harry Potter </target> franchise ..."
<target>,</target>.
For checking scripts with dummy datasets, run python3 train.py -num_epochs 1
Prepare entities mentioned above, and linking dataset.
./dataset/
directory.Make dataset creation more easier.
Pip packaging.
MIT