Convert Wikidata and Wikipedia raw files to filterable formats with a focus of marking Wikidata as summaries based on their Wikipedia abstracts.
CC-BY-4.0 License
This project focuses on the pre-processing steps required for the Wiki Entity Summarization (Wiki ES) project. It involves building the necessary databases and loading data from various sources to prepare for the entity summarization tasks.
For the pre-processing steps, we used an r5a.4xlarge instance on AWS with the following specifications:
To get started with the pre-processing, follow these steps:
pip install wikimapper
If you would like to download the latest version, run the following:
EN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path}
wikimapper download enwiki-latest --dir $EN_WIKI_REDIRECT_AND_PAGES_PATH
After having enwiki-{VERSION}-page.sql.gz
, enwiki-{VERSION}-redirect.sql.gz
,
and enwiki-{VERSION}-page_props.sql.gz
loaded under your data directory, run the following commands:
VERSION={VERSION}
EN_WIKI_REDIRECT_AND_PAGES_PATH={your_files_path}
INDEX_DB_PATH="`pwd`/data/index_enwiki-$VERSION.db"
wikimapper create enwiki-$VERSION --dumpdir $EN_WIKI_REDIRECT_AND_PAGES_PATH --target $INDEX_DB_PATH
./config-files-generator.sh
source .env
cat <<EOT > sqlite-to-page-migration.load
load database
from $INDEX_DB_PATH
into postgresql://$DB_USER:$DB_PASS@$DB_HOST:$DB_PORT/$DB_NAME
with include drop, create tables, create indexes, reset sequences
;
EOT
pgloader ./sqlite-to-page-migration.load
python3 missing_data_correction.py
The pre-processing steps involve loading data from the following sources:
Wikidata Graph Builder (wdgp)
.
docker-compose up wdgp
Wikipedia Page Extractor ( wppe)
.
docker-compose up wppe
When both datasets are loaded into the databases, we start processing all the available pages in the Wikipedia dataset
to extract the abstract and infobox of the corresponding Wikidata entity. Later, these pages are marked from the
extracted data, and the edges containing the marked pages are marked as candidates. Since Wikidata is a heterogeneous
graph with multiple types of edges, we need to pick the most relevant edge as a summary between two entities for the
summarization task. This module is called Wiki Summary Annotator (wsa)
, and we
use DistilBERT to filter the most relevant edge.
docker-compose up wsa
By running the above commands, you will have the necessary databases and data loaded to start the Wiki Entity Summarization project. The next steps involve providing a set of seed nodes based on your preference along with other configuration parameters to get a fully customized Entity Summarization Dataset.
If you use this project in your research, please cite the following paper:
@misc{javadi2024wiki,
title = {Wiki Entity Summarization Benchmark},
author = {Saeedeh Javadi and Atefeh Moradan and Mohammad Sorkhpar and Klim Zaporojets and Davide Mottin and Ira Assent},
year = {2024},
eprint = {2406.08435},
archivePrefix = {arXiv},
primaryClass = {cs.IR}
}
This project is licensed under the CC BY 4.0 License. See the LICENSE file for details.