
Keyphrase Generation for Scientific Document Retrieval


This repository contains the code for reproducing the experiments from the paper:

  • Keyphrase Generation for Scientific Document Retrieval.
    Florian Boudin, Ygor Gallina, Akiko Aizawa.
    Association for Computational Linguistics (ACL), 2020.



Here, we use the NTCIR-2 ad-hoc monolingual (English) IR test collection. The test collection contains 322,058 documents, 49 search topics and relevance judgments.

|-- data
    |-- docs
        |-- ntc1.e1.gz  // NTCIR-1 (#187,080) collection converted with 
        |-- ntc2-e1g.gz // NTCIR-2 (#77,433) NACSIS Academic Conference Papers Database
        |-- ntc2-e1k.gz // NTCIR-2 (#57,545) NACSIS Grant-in-Aid Scientific Research Database
    |-- rels
        |-- rel1_ntc2-e2_0101-0149 // judgments for relevant documents 
        |-- rel2_ntc2-e2_0101-0149 // judgments for partially relevant documents 
    |-- topics
        |-- topic-e0101-0149 // English topics for NTCIR-2

Installing anserini

Here, we use the open-source information retrieval toolkit anserini which is built on Lucene. Below are the installation steps for a mac computer (tested on OSX 10.14) based on their colab demo.

# install maven
brew cask install adoptopenjdk
brew install maven

# cloning / installing anserini
git clone --recurse-submodules
cd anserini/
# changing jacoco from 0.8.2 to 0.8.3 in pom.xml to build correctly
mvn clean package appassembler:assemble

# compile evaluation tools and other scripts
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..


Converting documents to TREC format

First, we convert NTCIR SGML formatted documents to TREC format for easier indexing.


    <ACCN>...</ACCN> // doc_id
    <TITE>...</TITE> or <PJNE>...</PJNE> // Title
    <AUPE>...</AUPE> // Authors
    <CNFE>...</CNFE> // Conference name
    <CNFD>...</CNFD> // Conference date
    <ABSE>           // Abstract
        <ABSE.P>...</ABSE.P> // paragraph
    <KYWE>...</KYWE> // Keywords
    <SOCE>...</SOCE> // Host society


    <DOCNO>...</DOCNO>  // doc_id
    <TITLE>...</TITLE>  // title
    <TEXT>...</TEXT>    // abstract
    <HEAD>...</HEAD>    // keywords (optional)

by doing:

sh src/

Some statistics about the generated data:

ntc1-e1: 187,080 documents, 185,061 with keywords
ntc2-e1g: 77,433 documents, 75,081 with keywords
ntc2-e1k: 57,545 documents, 57,443 with keywords

all 322,058 documents, 317,585 with keywords (98.6%)

Creating indexes

We are now ready for indexing!

sh src/


Converting topics to TREC format

Again, we have to convert NTCIR topics to TREC format for easier retrieval.


<TOPIC q=0101> // topic number is an attribute here

    <TITLE>       // title part
    <DESCRIPTION> // sentence-length description
    <NARRATIVE>  // longer narrative
    <CONCEPT>  // concepts (?)
    <FIELD>  // fields (?)



    <num> Number: XXX 
    <title> ...
    <desc> Description: 
    <narr> Narrative: 

by doing:

# create topic file with title / description / narrative
python3 src/ \
        --input data/topics/topic-e0101-0149 \
        --output data/topics/topic-e0101-0149.title+desc+narr.trec \

Topics are categorized into fields:

  1. Electricity, information and control
  2. Chemistry
  3. Architecture, civil engineering and landscape gardening
  4. Biology and agriculture
  5. Science
  6. Engineering
  7. Medicine and dentistry
  8. Cultural and social science

Retrieving documents

We are now ready to retrieve !

sh src/ 

Note that the default topic field used for retrieving documents is set to title by default according to anserini SearchCollection helper:

 -topicfield VAL             : Which field of the query should be used, default
                               "title". For TREC ad hoc topics, description or
                               narrative can be used. (default: title)


sh src/


Results for retrieval models using keyphrase generation are reported in the table below. Two initial indexing configurations are examined: title and abstract only (T+A), and title, abstract and author keywords (T+A+K).

T+A 0.2916 0.3193 0.2898 0.3147
+s2s-copy-top5-all 0.3045 0.3356 0.3012 0.3233
+s2s-corr-top5-all 0.3010 0.3306 0.2941 0.3079
+multipartiterank-top5 0.2924 0.3227 0.2956 0.3269
T+A+K 0.3138 0.3517 0.3063 0.3300
+s2s-copy-top5-all 0.3157 0.3652 0.3163 0.3367
+s2s-corr-top5-all 0.3137 0.3526 0.3101 0.3260
+multipartiterank-top5 0.3138 0.3518 0.3123 0.3347
