Training scripts and instructions how to reproduce our systems submitted to the NEWS 2018 Task on Transliteration of Named Entities: R. Grundkiewicz, K. Heafield: Neural Machine Translation Techniques for Named Entity Transliteration, NEWS 2018, ACL
MIT License
This repository contains training scripts and instructions how to reproduce our systems submitted to the NEWS 2018 Shared Task on Transliteration of Named Entities, and described in R. Grundkiewicz, K. Heafield: Neural Machine Translation Techniques for Named Entity Transliteration, NEWS 2018, ACL 2018
Citation:
@InProceedings{grundkiewicz-heafield:2018:NEWS2018,
author = {Grundkiewicz, Roman and Heafield, Kenneth},
title = {Neural Machine Translation Techniques for Named Entity Transliteration},
booktitle = {Proceedings of the Seventh Named Entities Workshop},
month = {July},
year = {2018},
address = {Melbourne, Australia},
publisher = {Association for Computational Linguistics},
pages = {89--94},
url = {http://www.aclweb.org/anthology/W18-2413}
}
Download and compile Marian in tools/marian-dev
:
cd tools
git clone https://github.com/marian-nmt/marian-dev
mkdir marian-dev/build
cd marian-dev/build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j8
cd ../../..
If needed, please refer to the official Marian documentation at https://marian-nmt.github.io/docs
Download data sets 01-04 from http://workshop.colips.org/news2018/dataset.html and unzip them
into datasets
.
Prepare training and development data:
cd experiments
./prepare-data.sh
Train baseline systems specifying GPU device(s) and one or more language directions, e.g.:
./train.sh '0 1' EnVi EnCh ChEn
Each system will be an ensemble of 4 deep RNN models rescored by 2 right-left models.
The evaluation scores can be collected by running:
./show-results.sh
A text file can be translated using the translate.sh
script, for example:
head data/EnVi.dev.src | ./translate.sh EnVi file.tmp 0 > file.out
Prepare synthetic data with the back-translation or forward-translation method:
./prepare-synthetic-data.sh
The systems can be re-trained with additional data by replacing original folders and re-running the training script, e.g.:
mv data data.original
mv synthetic data
mv models models.baseline
./train.sh '0 1' EnVi EnCh ChEn
./show-results.sh
For the EnVi system, this should display results similar to the following:
ACC Fscore MRR MAPref
models.baseline/EnVi.1 0.4680 0.8742 0.5582 0.4680
models.baseline/EnVi.2 0.4900 0.8806 0.5693 0.4900
models.baseline/EnVi.3 0.4580 0.8744 0.5521 0.4580
models.baseline/EnVi.4 0.4600 0.8692 0.5543 0.4600
models.baseline/EnVi.ens 0.4740 0.8783 0.5649 0.4740
models.baseline/EnVi.ens.r2l 0.4800 0.8815 0.5767 0.4800
models.baseline/EnVi.ens.r2l.rescore 0.4880 0.8830 0.5777 0.4880
models.baseline/EnVi.r2l.1 0.4520 0.8710 0.5548 0.4520
models.baseline/EnVi.r2l.2 0.4860 0.8759 0.5791 0.4860
models/EnVi.1 0.4980 0.8856 0.5838 0.4980
models/EnVi.2 0.4860 0.8833 0.5771 0.4860
models/EnVi.3 0.4860 0.8836 0.5785 0.4860
models/EnVi.4 0.4980 0.8854 0.5833 0.4980
models/EnVi.ens 0.5000 0.8865 0.5859 0.5000
models/EnVi.ens.r2l 0.4820 0.8858 0.5817 0.4820
models/EnVi.ens.r2l.rescore 0.5020 0.8884 0.5905 0.5020
models/EnVi.r2l.1 0.4800 0.8843 0.5789 0.4800
models/EnVi.r2l.2 0.4920 0.8876 0.5860 0.4920