Code for the paper Data-to-Text Generation with Iterative Text Editing
APACHE-2.0 License
The code for generating text from RDF triples by iteratively applying sentence fusion on the templates.
A description of the method can be found in:
Zdenk Kasner & Ondej Duek (2020): Data-to-Text Generation with Iterative Text Editing. In: Proceedings of the 13th International Conference on Natural Language Generation (INLG 2020).
pip install -r requirements.txt
./download_datasets_and_models.sh
./run.sh webnlg
This branch contains the code for replicating the experiments described in the paper. The code works with the original implementation of LaserTagger and BERT in Tensorflow 1.15. However, the implementation is now somehow obsolete and less flexible.
If you wish to extend this implementation, you may be interested in the version implemented in PyTorch Lightning, which is in the torch branch.
requirements.txt
for the full listAll packages can be installed using
pip install -r requirements.txt
Select tensorflow-1.15
instead of tensorflow-1.15-gpu
if you wish not to use the GPU.
All datasets and models can be downloaded using the command:
./download_datasets_and_models.sh
The following lists the dependencies (datasets, models and external repositiories) downloaded by the script. The script does not re-download the dependencies which are already located in their respective path.
transformers
package)The pipeline involves four steps:
All steps can be run separately by following the instructions below, or all at once using the script
./run.sh <experiment>
where <experiment>
can be one of:
webnlg
- train and evaluate on the WebNLG datasete2e
- train and evaluate on the E2E datasetdf-webnlg
- train on DiscoFuse and evaluate on WebNLG (zero shot domain adaptation)df-e2e
- train on DiscoFuse and evaluate on E2E (zero shot domain adaptation)Preprocessing involves parsing the original data-to-text datasets and extracting examples for training the sentence fusion model.
Example of using the preprocessing script:
# preprocessing the WebNLG dataset in the full mode
python3 preprocess.py \
--dataset "WebNLG" \
--input "datasets/webnlg/data/v1.4/en/" \
--mode "full" \
--splits "train" "test" "dev" \
--lms_device "cpu"
Things you may want to consider:
--mode
) can be set to full
, best_tgt
or best
. The modes are described in the supplementary material of the paper.
full
and runs on CPU.best_tgt
and best
use the LMScorer and can use GPU (--lms_device gpu
).--force_generate_templates
. However, note that double templates for E2E have been manually denoised (the generated version will not be identical to the one used in the experiments).datasets.py
: adding a custom class derived from Dataset
and overriding relevant methods. The dataset is then selected with the parameter --dataset
using the class name as an argument.Training generally follows the pipeline for finetuning the LaserTagger model. However, instead of using individual scripts for each step, the training pipeline is encapsulated in train.py
.
Example of using the training script:
python3 train.py \
--dataset "WebNLG" \
--mode "full" \
--experiment "webnlg_full" \
--vocab_size 100 \
--num_train_steps 10000
Things you may want to consider:
model_tf.py
. The wrapper calls the methods from the LaserTagger repository (directory lasertagger_tf
) similarly to the original implementation.--train_only
and --export_only
can be used to skip other pipeline phases.tensorflow-1.15-gpu
) but a GPU is not used, check if CUDA libraries were linked correctly.Once the model is trained, the decoding algorithm is used to generate text from RDF triples. See the top figure and/or the paper for the details on the method.
Example of using the decoding script:
python3 decode.py \
--dataset "WebNLG" \
--experiment "webnlg_full" \
--dataset_dir "datasets/webnlg/data/v1.4/en/" \
--split "test" \
--lms_device "cpu" \
--vocab_size 100
Things you may want to consider:
--lms_device gpu
). Note however this may require a secondary GPU if the GPU is already used for LaserTagger.out/<experiment>_<vocab_size>_<split>.hyp
.--use_e2e_double_templates
for bootstrapping the decoding process from the templates for pairs of triples in the case of the E2E dataset. The templates for single triples (handcrafted for E2E) are used otherwise.--no_export
in order to suppress saving the output to the out
directory.The decoded output is evaluated against multiple references using the e2e-metrics
package.
Example of using the evaluation script:
python3 evaluate.py \
--ref_file "data/webnlg/ref/test.ref" \
--hyp_file "out/webnlg_full_100_test.hyp" \
--lowercase
@inproceedings{kasner-dusek-2020-data,
title = "Data-to-Text Generation with Iterative Text Editing",
author = "Kasner, Zden{\v{e}}k and
Du{\v{s}}ek, Ond{\v{r}}ej",
booktitle = "Proceedings of the 13th International Conference on Natural Language Generation",
month = dec,
year = "2020",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.inlg-1.9",
pages = "60--67"
}