Large-scale biomedical relation extraction with entity and pair embeddings
This repository contains source code to learn dense semantic representations for biomedical entities and pairs of entities as used in Sänger and Leser: "Large-scale Entity Representation Learning for Biomedical Relationship Extraction" (Bioinformatics, 2020).
The approach aims to perform biomedical relation extraction on corpus-level based on entity and entity pair embeddings learned on the complete PubMed corpus. For this we use focus on all articles mentioning a certain biomedical entity (e.g. mutation V600E) or pair of entities within the article title or abstract. We concatenate all articles mention the entity / entity pair and apply paragraph vectors (Le and Mikolov, 2014) to learn an embedding for each distinct entity resp. pair of entities.
Content: Usage | Pre-trained Entity Embeddings | Embedding Training | Supported Entity Types | Citation | Acknowledgements |
The implementation of the embeddings is based on Gensim. The following snippet highlights the basic use of the pre-trained embeddings.
from gensim.models import KeyedVectors
# Loading pre-trained entity model
model = KeyedVectors.load("mutation-v0500.bin")
# Print number of distinct entities of the model
print(f"Distinct entities: {len(model.vocab)}\n")
# Get the embedding for an specific entity
entity_embedding = model["rs113488022"]
print(f"Embedding of rs113488022:\n{entity_embedding}\n")
# Find similar entities
print("Most similar entities to rs113488022:")
top5_nearest_neighbors = model.most_similar("rs113488022", topn=5)
for i, (entity_id, sim) in enumerate(top5_nearest_neighbors):
print(f" {i+1}: {entity_id} (similarity: {sim:.3f})")
This should output:
Distinct entities: 47498
Embedding of rs113488022:
[ 1.15715809e-01 4.90018785e-01 -6.05004542e-02 -8.35603476e-02
9.20398310e-02 -1.51171118e-01 4.01901715e-02 -2.36775234e-01
...
]
Most similar entities to rs113488022:
1: rs121913227 (similarity: 0.690)
2: rs121913364 (similarity: 0.628)
3: rs121913529 (similarity: 0.610)
4: rs121913357 (similarity: 0.573)
5: rs11554290 (similarity: 0.571)
For the computing entity and entity pair embeddings we utilize the complete PubMed corpus and make use of the data and entity annotations provided by PubTator Central.
python download_resources.py --resources pubtator_central
Note: The annotation data requires > 70GB of disk space.
Learning entity embeddings can be done in two steps:
python prepare_entity_dataset.py --working_dir _out --entity_type mutation
We support entity types cell line, chemical, disease, drug, gene, mutation, and species.
python learn_embeddings.py --input_file _out/mutation/doc2vec_input.txt \
--config_file ../resources/configurations/doc2vec-0500.config \
--model_name mutation-v0500 \
--output_dir _out/mutation
Example configurations can be found in resources/configurations.
To learn entity pair embeddings, preparation of the entity annotations has to be performed first (see above). Analogously to the entity embeddings, learning of pair embeddings is performed in two steps:
python prepare_pair_dataset.py --working_dir _out --source_type mutation --target_type disease
We support entity types disease, drug, and mutation.
python learn_embeddings.py --input_file _out/mutation-disease/doc2vec_input.txt \
--config_file ../resources/configurations/doc2vec-0500.config \
--model_name mutation-disease-v0500 \
--output_dir _out/mutation-disease
Example configurations can be found in resources/configurations.
Entity Type | Identifier | Example |
---|---|---|
Cell line | Cellosaurus ID | CVCL:0027 (Hep-G2) |
Chemical | MeSH | MESH:D000068878 (hTrastuzumab) |
Disease | MeSH | MESH:D006984 (hypertrophic chondrocytes) |
Disease Ontology ID (DOID) 1 | DOID:60155 (visual agnosia) | |
Drug | Drugbank ID | DB00166 (lipoic acid) |
Gene | NCBI Gene ID | NCBI:673 (BRAF) |
Mutation | RS-Identifier | rs113488022 (V600E) |
Species | NCBI Taxonomy | TAXON:9606 (human) |
1: Use option "--entity_type disease-doid" when calling prepare_entity_dataset.py
to normalize
disease annotations to the Disease Ontology.
Please use the following bibtex entry to cite our work:
@article{saenger2020entityrepresentation,
title={Large-scale Entity Representation Learning for Biomedical Relationship Extraction},
author={S{\"a}nger, Mario and Leser, Ulf},
journal={Bioinformatics},
year={2020},
publisher={Oxford University Press}
}
We use the annotations from PubTator Central to compute the entity embeddings. For further details see here and refer to:
Wei, Chih-Hsuan, et al. "PubTator central: automated concept annotation for biomedical full text articles." Nucleic acids research 47.W1 (2019): W587-W593.
We use information from the Disease Ontology to normalize disease annotations. For further details see here and refer to:
Schriml, Lynn M., et al. "Human Disease Ontology 2018 update: classification, content and workflow expansion." Nucleic acids research 47.D1 (2019): D955-D962.
We use the paragraph vectors model to perform entity representation learning. For further details see here and refer to:
Le, Quoc, and Tomas Mikolov. "Distributed representations of sentences and documents." International conference on machine learning. 2014.