
Automatically generate differential reaction fingerprints on reactions in Rhea

MIT License


Rhea Differential Reaction Fingerprints for Enzyme Classification Prediction

This repository generates differential reaction fingerprints for reactions in Rhea.


The SMILES dataframe and DRFP-derived fingerprint dataframe can be loaded from GitHub with:

import pandas as pd

base_url = "https://github.com/cthoyt/rhea-fingerprints/raw/main/docs"
smiles_url = f"{base_url}/127/reaction_smiles.tsv"
smiles_df = pd.read_csv(smiles_url, sep="\t")

fingerprint_url = f"{base_url}/127/reaction_fingerprints.tsv.gz"
fingerprint_df = pd.read_csv(fingerprint_url, sep="\t", index_col=0)

Here's a 2D PCA scatterplot of the embeddings:


This repository also generates reusable models for predicting enzyme codes based on DRFPs, trained using Rhea. It uses simple classifiers and performs really well.

You can re-use existing models in combination with drfp like:

import pystow
from drfp import DrfpEncoder

base_url = "https://github.com/cthoyt/rhea-fingerprints/raw/main/docs"
url = f"{base_url}/127/models/LogisticRegression.pkl"
clf = pystow.ensure_pickle("bio", "rhea", "models", "127", url=url)

rxn_smiles = [
fps = DrfpEncoder.encode(rxn_smiles)

predictions = clf.predict(fps)

Warning There might be some issues with reloading model weights, please let me know if this comes up.


Installation of the requirements and running of the build script are handled with tox. The current version of Rhea is looked up with bioversions so the provenance of the data can be properly traced. Run with:

$ pip install tox
$ tox

Additionally, a GitHub Action runs this update script on a monthly basis.


Code in this repository is licensed under the MIT License. Redistribution of parts of the Rhea database are redistributed under the CC-BY-4.0 license (more information here).


If you find this useful in your own work, please consider citing:

  author       = {Charles Tapley Hoyt},
  title        = {Rhea Differential Reaction Fingerprints for Enzyme Classification Prediction},
  month        = jan,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {v124},
  doi          = {10.5281/zenodo.7591839},
  url          = {https://doi.org/10.5281/zenodo.7591839}

I also gave a talk on this in case you want to read up more.


Rhea can be cited with:

    author = {Lombardot, Thierry and Morgat, Anne and Axelsen, Kristian B and Aimo, Lucila and Hyka-Nouspikel, Nevila and Niknejad, Anne and Ignatchenko, Alex and Xenarios, Ioannis and Coudert, Elisabeth and Redaschi, Nicole and Bridge, Alan},
    doi = {10.1093/nar/gky876},
    journal = {Nucleic acids research},
    number = {D1},
    pages = {D596--D600},
    pmid = {30272209},
    title = {{Updates in Rhea: SPARQLing biochemical reaction data.}},
    volume = {47},
    year = {2019}

Differential reaction fingerprints can be cited with:

    abstract = {Differential Reaction Fingerprint DRFP is a chemical reaction fingerprint enabling simple machine learning models running on standard hardware to reach DFT- and deep learning-based accuracies in reaction yield prediction and reaction classification.},
    author = {Probst, Daniel and Schwaller, Philippe and Reymond, Jean-Louis},
    doi = {10.1039/D1DD00006C},
    issn = {2635-098X},
    journal = {Digital Discovery},
    title = {{Reaction classification and yield prediction using the differential reaction fingerprint DRFP}},
    url = {http://xlink.rsc.org/?DOI=D1DD00006C},
    year = {2022}
Extracted from project README