NGEC -- Next generation political event coder

This repository contains the code for the Next Generation Event Coder (NGEC), a Python library for extracting event data from news text. The pipeline works out-of-the-box to code events using the PLOVER event ontology, but can be easily customized to produce events with a custom ontology.

It accompanies the working paper, "Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks".

Overview

We break the problem of event extraction into six steps:

Event classification: identify the event described in a document (e.g., PROTEST, ASSAULT, AGREE,...) using a transformer classifier trained on new data.
Sub-event (``mode'') classification: identify a more specific event type (e.g., PROTEST-riot, ASSAULT-aerial), also using a transformer-based classifier.
Context classification: identify themes or topics in a document (e.g., "human rights", "environment") using a classifier.
Event attribute identification: identifying the spans of text that report who carried out the event, who it was directed against, where it occurred, etc. We do this with a fine-tuned question-answering model trained on newly annotated text.
Actor, location, and date resolution: we resolve extracted named actors and recipients to their Wikipedia page using an offline Wikipedia index and a custom neural similarity model.
Entity categorization: Finally, we map the actor to their country and their "sector" code as defined by the PLOVER ontology (e.g., "GOV", "MIL", etc.)

Currently, this processing pipeline only performs the following steps:

Note: This repo has basic pretrained models for event detection, but does not currently include context and mode models.

Running

The main script is ngec_process.py.

python ngec_process.py

usage: ngec_process.py [-h] [-m -1] [-a NGEC/assets/PROP-SQuAD-trained-tinybert-6l-768d-squad2220302-1457] [-b NGEC/assets/]
                       [-g ../mordecai3/mordecai_new.pt]
                       [input_file]

positional arguments:
  input_file            JSONL input file. At a minimum, this should have keys for "id", "date", and
                        "event_text"

options:
  -h, --help            show this help message and exit
  -m, --max-stories -1
                        Max stories to code. -1 is all stories
  -a, --attribute-dir NGEC/assets/PROP-SQuAD-trained-tinybert-6l-768d-squad2220302-1457
                        Location of the QA attribute model
  -b, --base-path NGEC/assets/
                        Location of the other models and files
  -g, --geo-model ../mordecai3/mordecai_new.pt
                        Location of the geolocation model

{
  "id": "20190801-2227-8b13212ac6f6",
  "date": "2019-08-01",
  "event_type": [
    "ACCUSE",
    "REJECT",
    "THREATEN",
    "SANCTION"
  ],
  "event_mode": [],
  "event_text": "The Liberal Party, the largest opposition in Paraguay, .... ",
  "story_id": "EFESP00020190801ef8100001:50066618",
  "publisher": "translateme2-pt",
  "headline": "\nOposição confirma q...",
  "pub_date": "2019-08-01",
  "contexts": [
    "corruption"
  ],
  "version": "NGEC_coder-Vers001-b1-Run-001"
}

Quick start

First, create a new Conda environment and install the required libraries:

conda create -y --name ngec python=3.10
conda activate ngec

pip install spacy, textacy sentence-transformers
python -m spacy download en_core_web_trf
pip install elasticsearch elasticsearch_dsl unidecode dateparser
pip install jsonlines tqdm datasets rich plac 
pip install mordecai3

Next, set up an Elasticsearch server with an offline Wikipedia and Geonames. Download the pre-built index and start an Elasticsearch instance with the pre-built index (the code below assumes you have Docker installed).

# Download a pre-built index from my website:
wget https://andrewhalterman.com/files/geonames_wiki_index_2023-03-02.tar.gz
# uncompress it to produce a directory called `geonames_index` (note that this includes both geonames *and* Wiki)
tar -xvzf geonames_wiki_index_2023-03-02.tar.gz
# You may need to set write permissions for Docker to run
# chmod -R 777 ./geonames_index/
# Then start an Elasticsearch instance in Docker with the uncompressed index as a volume.
# Later versions of Elasticsearch have not been tested.
sudo docker run -d -p 127.0.0.1:9200:9200 -e "discovery.type=single-node" -v ./geonames_index/:/usr/share/elasticsearch/data elasticsearch:7.10.1

If you want to build these indices from scratch, see the detailed instructions for creating an offline Wikipedia index and setting up offline Geonames in Elasticsearch.

Note on the models

Because of conditions imposed by our funder and the proprietary data used in the project, we cannot share the training data or the trained event, mode, and context models used to produce the POLECAT event dataset. However, we provide example code for training classifiers on your own data in the setup directory. We also provide demonstration pretrained models for the event categories used in the POLECAT dataset that draw on a corpus of pseudo-labeled synthetic news stories using an approach described in Halterman (2023). These classifiers are not as accurate as the ones used in the POLECAT dataset, but work pretty well and could easily be improved with additional training data.

Citing

The steps that this pipeline implements are described in more detail in the paper. If you use the pipeline or the techniques we introduce, please cite the following:

@article{halterman_et_al2023creating,
  title={Creating Custom Event Data Without Dictionaries: A Bag-of-Tricks},
  author={Andrew Halterman and Philip A. Schrodt and Andreas Beger and Benjamin E. Bagozzi and Grace I. Scarborough},
  journal={arXiv preprint arXiv:2304.01331},
  year={2023}
}

Acknowledgements

This research was sponsored by the Political Instability Task Force (PITF). The PITF is funded by the Central Intelligence Agency. The views expressed in this paper are the authors’ alone and do not represent the views of the U.S. Government.

Related Projects

ktrain

ktrain is a Python library that makes deep learning and AI more accessible and easier to apply

06 Feb 2019 1,226

gazp

Source code for Grounded Adaptation for Zero-shot Executable Semantic Parsing

01 Feb 2021 21

fundus-evaluation

[ACL 2024] Evaluation of the Fundus News Scraper

06 Mar 2024 9

mtdata

A tool that locates, downloads, and extracts machine translation corpora

06 Apr 2020 139

GLM

GLM (General Language Model)

18 Mar 2021 3,170

TextAttack

TextAttack 🐙 is a Python framework for adversarial attacks, data augmentation, and model trainin...

15 Oct 2019 2,920

TransCoder

Public release of the TransCoder research project https://arxiv.org/pdf/2006.03511.pdf

10 Jul 2020 1,688

tabgenie

A multi-purpose toolkit for table-to-text generation: web interface, Python bindings, CLI commands.

12 Oct 2022 48

vid2cleantxt

Python API & command-line tool to easily transcribe speech-based video files into clean text

09 Mar 2021 186

DB

A PyTorch implementation of "Real-time Scene Text Detection with Differentiable Binarization".

18 Nov 2019 2,084

sentence-embedding-evaluation-german

Basically SentEval with German language downstream tasks

08 Apr 2022 0

Seq2Seq-Vis

Visualization for Sequential Neural Networks with Attention

16 May 2017 455

DAT8

General Assembly's 2015 Data Science course in Washington, DC

07 Aug 2015 1,606

code2vec

TensorFlow code for the neural network presented in the paper: "code2vec: Learning Distributed Re...

24 Jul 2018 1,096

entity-recognition-datasets

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These an...

01 Sep 2018 1,495