A GROBID module for extracting and structuring medical reports into structured XML/TEI encoded documents
grobid-medical-report is a GROBID module for extracting and restructuring medical reports from raw documents (PDF, text) into encoded documents (XML/TEI). All models built in this module are machine learning models that implement Wapiti CRF as Grobid's default models (it's possible to use deep learning models developed with DeLFT in Grobid as an alternative to the Wapiti CRF).
grobid-medical-report is a module of GROBID and therefore the installation of GROBID is required.
First, clone the latest version of GROBID. We can clone the forked project or the original GROBID repository (with slight adjustments to use grobid-medical-report module).
$ git clone https://github.com/tantikristanti/grobid.git
$ cd grobid
$ git checkout grobid-medical-report
** OR ** 2. Clone from the original GROBID repository:
$ git https://github.com/kermitt2/grobid.git
$ cd grobid
$ git checkout -b [NEW_BRANCH]
Install and build GROBID:
$ ./gradlew clean install
To install and build GROBID under the proxy, we need to add the proxy host and port:
$ ./gradlew -DproxySet=true -DproxyHost=[proxy_host] -DproxyPort=[proxy_port] clean install
Make sure that the current working directory is grobid
:
$ pwd
--> grobid
Clone grobid-medical-report from this repository:
$ git clone https://github.com/tantikristanti/grobid-medical-report.git
$ cd grobid-medical-report
$ pwd
--> grobid/grobid-medical-report
Install and build grobid-medical-report:
$ ./gradlew clean install
To install and build grobid-medical-report under the proxy, we need to add the proxy host and port:
$ ./gradlew -DproxySet=true -DproxyHost=[proxy_host] -DproxyPort=[proxy_port] clean install
Following GROBID, grobid-medical-report also builds models with a waterfall (cascade) approach. We prepare 11 sequence labeling models to parse medical documents in different hierarchical structures of the document.
Using these models, we can extract medical documents, in this case in French, with the following steps:
medical-report-segmenter
model to segment the input document (PDF) into the header, the body, notes (headnote, footnote, left-note, right-note), and page sections.header-medical-report
model for extracting information concerning patients, medical personnel, and documents (e.g., document number, document type, and date) found in the header section.full-medical-text
model.French-medical-NER
model for recognizing medical terminologies found mainly in body parts.Each of these models can be retrained by using additional data. A more detailed explanation of how to retrain and to evaluate the models can be found in Train and evaluate the models
A web-based front-end is provided for end-users to be able to use a number of methods and models in an attractive way, in addition to batch commands. To run the service, run the following command:
$ ./gradlew run
Service can be accessed via port 8090 (http://localhost:8090/).
http://localhost:8090/api/isalive
>>> The service will return whether it's up (true) or not (false)
http://localhost:8090/api/version
http://localhost:8090/api/grobidMedicalReport
More detailed explanations concerning API services provided by grobid-medical-report can be accessed here API services
This repository was originally prepared for a collaborative project between INRIA and APHP. Original datasets and models containing genuine sensitive data are not possible to share publicly.