基于Seq2Seq+Attention模型的Textsum文本自动摘要
==================
================= textsumtensorflow (1.0.0) Seq2Seq-attention, (tf 0.12.0deepnlp 0.1.5) Seq2Seq translate.py eval.py ROUGEBLEU
(SogouCS)2012671M lab http://www.sogou.com/labs/resource/cs.php 8CPULinux
Seq2Seq-attention : EncoderLSTM:
Bucket: 120PAD, 30 buckets = [(120, 30), ...]
: 1M , : headline_large.ckpt-48000.data-00000-of-00001, 3: *.tar.gz00, *.tar.gz01, *.tar.gz02
#: headline_large.ckpt-48000.data-00000-of-00001
cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz
linux
python predict.py
: TAG_DATE TAG_NUMBER TAG_NAME_EN TAG_DATE TAG_NUMBER TAG_DATE TAG_NUMBER TAG_NAME_EN
:
: TAG_NAME_EN TAG_DATE TAG_NAME_EN TAG_NUMBER TAG_NAME_EN TAG_NUMBER TAG_NAME_EN TAG_NUMBER
: TAG_NUMBER
...
: python predict.py arg1 arg2 arg3 arg1 1: ; arg2 2: Reference; arg2 3:
textsum
folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt
python predict.py $input_dir $reference_dir $summary_dir
DecoderAttention, Tensorflowtf.nn.seq2seq, seq2seq_attn.py : http://www.deepnlp.org/blog/textsum-seq2seq-attention/
attentionHeatmap
# eval.py plot_attention(data, X_label=None, Y_label=None)
# attentionHeatmapimg
python predict_attn.py
Example:
textsum: content-train.txt, title-train.txt, content-dev.txt, title-dev.txt contenttitle:
content-train.txt
TAG_NAME_EN TAG_NAME_EN TAG_DATE TAG_NAME_EN TAG_NAME_EN TAG_NAME_EN
TAG_DATE TAG_NAME_EN TAG_NAME_EN
...
title-train.txt
...
ckpt
python headline.py
This textsum example is implementing Seq2Seq model on tensorflow for the automatic summarization task. The code is modified from the original English-French translate.py model in tensorflow tutorial. Evaluation method of ROUGE and BLEU is provided in the eval.py module. Pre-trained Chinese news articles' headline generation model is also distributed.
We choose the Chinese news corpus form sohu.com. You can download it from http://www.sogou.com/labs/resource/cs.php
Seq2Seq-attention config params:
Encoder LSTM:
Bucket: News article cutted to 120, news with fewer words will be padded with 'PAD', Titles length cut to 30. *buckets = [(120, 30), ...]
Model File: The pre-trained model file has name 'headline_large.ckpt-48000.data-00000-of-00001' and it is compressed and split into 3 files: *.tar.gz00, *.tar.gz01, *.tar.gz02
#Model File Name: headline_large.ckpt-48000.data-00000-of-00001
cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz
Run the predict.py and interactively input the Chinese news text(space separated), automatic generated headline will be returned.
python predict.py
Output examples
news: TAG_DATE TAG_NUMBER TAG_NAME_EN TAG_DATE TAG_NUMBER TAG_DATE TAG_NUMBER TAG_NAME_EN
headline:
news: TAG_NAME_EN TAG_DATE TAG_NAME_EN TAG_NUMBER TAG_NAME_EN TAG_NUMBER TAG_NAME_EN TAG_NUMBER
headline: TAG_NUMBER
...
Run the command line: python predict.py arg1 arg2 arg3; args1: directory of space separated Chinese news content corpus, one article content per line; args2: directory of human generated headline for the news, one title per line; args3: directory of machine generated summary to be saved.
The model will call the evaluate(X, Y, method = "rouge_n", n = 2) method in the eval.py
folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt
python predict.py $input_dir $reference_dir $summary_dir
To get the attention mask matrix, we need to modified the standard seq2seq ops tf.nn.seq2seq. Right now there is not available method to extract those tensors so we need to modify the source file. We save the modified file to seq2seq_attn.py in this package. Please check out this blog for details: http://www.deepnlp.org/blog/textsum-seq2seq-attention/
# Call the method in eval.py: plot_attention(data, X_label=None, Y_label=None), based on matplotlib package
# The attention heatmap will be saved under the /img folder
python predict_attn.py
Examples:
Prepare four documents: content-train.txt, title-train.txt, content-dev.txt, content-dev.txt, title-dev.txt The format of corpus is as below:
content-train.txt
TAG_NAME_EN TAG_NAME_EN TAG_DATE TAG_NAME_EN TAG_NAME_EN TAG_NAME_EN
TAG_DATE TAG_NAME_EN TAG_NAME_EN
...
title-train.txt
...
python headline.py