==================

================= textsumtensorflow (1.0.0) Seq2Seq-attention, (tf 0.12.0deepnlp 0.1.5) Seq2Seq translate.py eval.py ROUGEBLEU

(SogouCS)2012671M lab http://www.sogou.com/labs/resource/cs.php 8CPULinux

Seq2Seq-attention : EncoderLSTM:

num_layers = 4 # 4LSTM Layer
size = 256 # 256
num_samples = 4096 #4096
batch_size = 64 # 64
vocab_size = 50000 # 50000

Bucket: 120PAD, 30 buckets = [(120, 30), ...]

: 1M , : headline_large.ckpt-48000.data-00000-of-00001, 3: *.tar.gz00, *.tar.gz01, *.tar.gz02

#: headline_large.ckpt-48000.data-00000-of-00001

cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz

linux

python predict.py

:        TAG_DATE TAG_NUMBER       TAG_NAME_EN  TAG_DATE TAG_NUMBER  TAG_DATE TAG_NUMBER  TAG_NAME_EN                                  
:            

:            TAG_NAME_EN   TAG_DATE            TAG_NAME_EN   TAG_NUMBER TAG_NAME_EN TAG_NUMBER            TAG_NAME_EN  TAG_NUMBER      
:      TAG_NUMBER  
...

ROUGE

: python predict.py arg1 arg2 arg3 arg1 1: ; arg2 2: Reference; arg2 3:

textsum

folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt

python predict.py $input_dir $reference_dir $summary_dir

attention

DecoderAttention, Tensorflowtf.nn.seq2seq, seq2seq_attn.py : http://www.deepnlp.org/blog/textsum-seq2seq-attention/

predict_attn.py

attentionHeatmap

#  eval.py plot_attention(data, X_label=None, Y_label=None)
# attentionHeatmapimg
python predict_attn.py

Example:

textsum: content-train.txt, title-train.txt, content-dev.txt, title-dev.txt contenttitle:

content-train.txt

    TAG_NAME_EN       TAG_NAME_EN      TAG_DATE        TAG_NAME_EN         TAG_NAME_EN  TAG_NAME_EN 
        TAG_DATE TAG_NAME_EN                                          TAG_NAME_EN           
...

title-train.txt

...

ckpt

python headline.py

Tutorial

This textsum example is implementing Seq2Seq model on tensorflow for the automatic summarization task. The code is modified from the original English-French translate.py model in tensorflow tutorial. Evaluation method of ROUGE and BLEU is provided in the eval.py module. Pre-trained Chinese news articles' headline generation model is also distributed.

Corpus and Config

We choose the Chinese news corpus form sohu.com. You can download it from http://www.sogou.com/labs/resource/cs.php

Seq2Seq-attention config params:

Encoder LSTM:

num_layers = 4 # 4 layer LSTM
size = 256 # 256 nodes per layer
num_samples = 4096 # negative sampling during softmax 4096
batch_size = 64 # 64 examples per batch
vocab_size = 50000 # top 50000 words in dictionary

Bucket: News article cutted to 120, news with fewer words will be padded with 'PAD', Titles length cut to 30. *buckets = [(120, 30), ...]

Model File: The pre-trained model file has name 'headline_large.ckpt-48000.data-00000-of-00001' and it is compressed and split into 3 files: *.tar.gz00, *.tar.gz01, *.tar.gz02

#Model File Name: headline_large.ckpt-48000.data-00000-of-00001

cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz

Prediction

Run the script

Run the predict.py and interactively input the Chinese news text(space separated), automatic generated headline will be returned.

python predict.py

Output examples

news:        TAG_DATE TAG_NUMBER       TAG_NAME_EN  TAG_DATE TAG_NUMBER  TAG_DATE TAG_NUMBER  TAG_NAME_EN                                  
headline:            

news:            TAG_NAME_EN   TAG_DATE            TAG_NAME_EN   TAG_NUMBER TAG_NAME_EN TAG_NUMBER            TAG_NAME_EN  TAG_NUMBER      
headline:      TAG_NUMBER  
...

Evaluate ROUGE-score

Run the command line: python predict.py arg1 arg2 arg3; args1: directory of space separated Chinese news content corpus, one article content per line; args2: directory of human generated headline for the news, one title per line; args3: directory of machine generated summary to be saved.

The model will call the evaluate(X, Y, method = "rouge_n", n = 2) method in the eval.py

folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt

python predict.py $input_dir $reference_dir $summary_dir

Attention Visualization

To get the attention mask matrix, we need to modified the standard seq2seq ops tf.nn.seq2seq. Right now there is not available method to extract those tensors so we need to modify the source file. We save the modified file to seq2seq_attn.py in this package. Please check out this blog for details: http://www.deepnlp.org/blog/textsum-seq2seq-attention/

Run predict_attn.py

# Call the method in eval.py: plot_attention(data, X_label=None, Y_label=None), based on matplotlib package
# The attention heatmap will be saved under the /img folder
python predict_attn.py

Examples:

Training

Corpus format

Prepare four documents: content-train.txt, title-train.txt, content-dev.txt, content-dev.txt, title-dev.txt The format of corpus is as below:

content-train.txt

    TAG_NAME_EN       TAG_NAME_EN      TAG_DATE        TAG_NAME_EN         TAG_NAME_EN  TAG_NAME_EN 
        TAG_DATE TAG_NAME_EN                                          TAG_NAME_EN           
...

title-train.txt

...

Run the script

python headline.py

Reference

Tensorflow textsum examples on English Gigaword Corpus
https://github.com/tensorflow/models/tree/master/textsum
Tensorflow seq2seq ops:
https://github.com/tensorflow/tensorflow/blob/64edd34ce69b4a8033af5d217cb8894105297d8a/tensorflow/contrib/legacy_seq2seq/python/ops/seq2seq.py
Blog
http://www.deepnlp.org/blog/textsum-seq2seq-attention/

seq2seq-textsum

ROUGE

attention

predict_attn.py

Tutorial

Corpus and Config

Prediction

Run the script

Evaluate ROUGE-score

Attention Visualization

Run predict_attn.py

Training

Corpus format

Run the script

Reference