seq2seq-textsum

基于Seq2Seq+Attention模型的Textsum文本自动摘要

Stars
6

==================

================= textsumtensorflow (1.0.0) Seq2Seq-attention, (tf 0.12.0deepnlp 0.1.5) Seq2Seq translate.py eval.py ROUGEBLEU


(SogouCS)2012671M lab http://www.sogou.com/labs/resource/cs.php 8CPULinux

Seq2Seq-attention : EncoderLSTM:

  • num_layers = 4 # 4LSTM Layer
  • size = 256 # 256
  • num_samples = 4096 #4096
  • batch_size = 64 # 64
  • vocab_size = 50000 # 50000

Bucket: 120PAD, 30 buckets = [(120, 30), ...]

: 1M , : headline_large.ckpt-48000.data-00000-of-00001, 3: *.tar.gz00, *.tar.gz01, *.tar.gz02

#: headline_large.ckpt-48000.data-00000-of-00001

cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz

linux

python predict.py
:        TAG_DATE TAG_NUMBER       TAG_NAME_EN  TAG_DATE TAG_NUMBER  TAG_DATE TAG_NUMBER  TAG_NAME_EN                                  
:            

:            TAG_NAME_EN   TAG_DATE            TAG_NAME_EN   TAG_NUMBER TAG_NAME_EN TAG_NUMBER            TAG_NAME_EN  TAG_NUMBER      
:      TAG_NUMBER  
...

ROUGE

: python predict.py arg1 arg2 arg3 arg1 1: ; arg2 2: Reference; arg2 3:

textsum

folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt

python predict.py $input_dir $reference_dir $summary_dir

attention

DecoderAttention, Tensorflowtf.nn.seq2seq, seq2seq_attn.py : http://www.deepnlp.org/blog/textsum-seq2seq-attention/

predict_attn.py

attentionHeatmap

#  eval.py plot_attention(data, X_label=None, Y_label=None)
# attentionHeatmapimg
python predict_attn.py

Example: image


textsum: content-train.txt, title-train.txt, content-dev.txt, title-dev.txt contenttitle:

content-train.txt

    TAG_NAME_EN       TAG_NAME_EN      TAG_DATE        TAG_NAME_EN         TAG_NAME_EN  TAG_NAME_EN 
        TAG_DATE TAG_NAME_EN                                          TAG_NAME_EN           
...

title-train.txt

    
           ...

ckpt

python headline.py

Tutorial

This textsum example is implementing Seq2Seq model on tensorflow for the automatic summarization task. The code is modified from the original English-French translate.py model in tensorflow tutorial. Evaluation method of ROUGE and BLEU is provided in the eval.py module. Pre-trained Chinese news articles' headline generation model is also distributed.

Corpus and Config

We choose the Chinese news corpus form sohu.com. You can download it from http://www.sogou.com/labs/resource/cs.php

Seq2Seq-attention config params:

Encoder LSTM:

  • num_layers = 4 # 4 layer LSTM
  • size = 256 # 256 nodes per layer
  • num_samples = 4096 # negative sampling during softmax 4096
  • batch_size = 64 # 64 examples per batch
  • vocab_size = 50000 # top 50000 words in dictionary

Bucket: News article cutted to 120, news with fewer words will be padded with 'PAD', Titles length cut to 30. *buckets = [(120, 30), ...]

Model File: The pre-trained model file has name 'headline_large.ckpt-48000.data-00000-of-00001' and it is compressed and split into 3 files: *.tar.gz00, *.tar.gz01, *.tar.gz02

#Model File Name: headline_large.ckpt-48000.data-00000-of-00001

cd ./ckpt
cat headline_large.ckpt-48000.* > headline_large.ckpt-48000.data-00000-of-00001.tar.gz
tar xzvf headline_large.ckpt-48000.data-00000-of-00001.tar.gz

Prediction

Run the script

Run the predict.py and interactively input the Chinese news text(space separated), automatic generated headline will be returned.

python predict.py

Output examples

news:        TAG_DATE TAG_NUMBER       TAG_NAME_EN  TAG_DATE TAG_NUMBER  TAG_DATE TAG_NUMBER  TAG_NAME_EN                                  
headline:            

news:            TAG_NAME_EN   TAG_DATE            TAG_NAME_EN   TAG_NUMBER TAG_NAME_EN TAG_NUMBER            TAG_NAME_EN  TAG_NUMBER      
headline:      TAG_NUMBER  
...

Evaluate ROUGE-score

Run the command line: python predict.py arg1 arg2 arg3; args1: directory of space separated Chinese news content corpus, one article content per line; args2: directory of human generated headline for the news, one title per line; args3: directory of machine generated summary to be saved.

The model will call the evaluate(X, Y, method = "rouge_n", n = 2) method in the eval.py

folder_path=`pwd`
input_dir=${folder_path}/news/test/content-test.txt
reference_dir=${folder_path}/news/test/title-test.txt
summary_dir=${folder_path}/news/test/summary.txt

python predict.py $input_dir $reference_dir $summary_dir

Attention Visualization

To get the attention mask matrix, we need to modified the standard seq2seq ops tf.nn.seq2seq. Right now there is not available method to extract those tensors so we need to modify the source file. We save the modified file to seq2seq_attn.py in this package. Please check out this blog for details: http://www.deepnlp.org/blog/textsum-seq2seq-attention/

Run predict_attn.py

# Call the method in eval.py: plot_attention(data, X_label=None, Y_label=None), based on matplotlib package
# The attention heatmap will be saved under the /img folder
python predict_attn.py

Examples: image

Training

Corpus format

Prepare four documents: content-train.txt, title-train.txt, content-dev.txt, content-dev.txt, title-dev.txt The format of corpus is as below:

content-train.txt

    TAG_NAME_EN       TAG_NAME_EN      TAG_DATE        TAG_NAME_EN         TAG_NAME_EN  TAG_NAME_EN 
        TAG_DATE TAG_NAME_EN                                          TAG_NAME_EN           
...

title-train.txt

    
           ...

Run the script

python headline.py

Reference