grobid

A machine learning software for extracting information from scholarly documents

APACHE-2.0 License

Downloads
627
Stars
3K
Committers
52

Bot releases are hidden (Show)

grobid - Version 0.8.0 Latest Release

Published by kermitt2 11 months ago

Added

  • Extraction of funder and funding information with a specific new model, see https://github.com/kermitt2/grobid/pull/1046 for details
  • Optional consolidation of funder with CrossRef Funder Registry
  • Identification of acknowledged entities in the acknowledgement section
  • Optional coordinates in title elements

Changed

  • Dropwizard upgrade to 4.0
  • Minimum JDK/JVM requirement for building/running the project is now 1.11
  • Logging now with Logback, removal of Log4j2, optional logs in json format
  • General review of logs
  • Enable Github actions / Disable circleci

Fixed

  • Set dynamic memory limit in pdfalto_server #1038
  • Logging in files when training models work now as expected
  • Various dependency upgrades
  • Fix #1051 with possible problematic PDF
  • Fix #1036 for pdfalto memory limit
  • fix readthedocs build #1040
  • fix for null equation #1030
  • Other minor fixes
grobid - Version 0.7.3

Published by kermitt2 over 1 year ago

Added

  • Support for JDK beyond 1.11, tested up to Java 17, thanks to removal of dynamic native library loading after the start of the JVM
  • Incremental training (all models and ML engines), add this option in training command line and training web service (#971)
  • Systematic benchmarking on two new sets: PLOS (1000 artilces) and eLife (984 articles)
  • All end-to-end evaluation datasets are now available from the same place: https://zenodo.org/record/7708580
  • Option to output coordinates in notes and figure/table captions
  • Support for Mac ARM architecture (#975)
  • Play With Docker documentation (#962)

Changed

  • Update to DeLFT version 0.3.3
  • Demo now hosted as HuggingFace space
  • Additional training data, in particular for citation, reference-segmenter, segmentation, header, etc.
  • Update Deep Learning models (and some of the CRF)
  • The standard analyzer for sub-lexical tokenization is available in grobid-core, and used for the citation model (in particular for improving CJK references) (#990)
  • Update evaluations

Fixed

  • Correct wrong content type in doc for processCitation web service
  • Sentence segmentation applied to notes (#995)
  • Other minor fixes
grobid - Version 0.7.2

Published by kermitt2 almost 2 years ago

Added

  • Explicit identification of data/code availability statements (#951) and funding statements (#959), including when they are located in the header
  • Link footnote and their "callout" marker in full text (#944)
  • Option to consolidate header only with DOI if a DOI is extracted (#742)
  • "Window" application of RNN model for reference-segmenter to cover long bibliographical sections
  • Add dynamic timeout on pdfalto_server (#926)
  • A modest Python script to help to find "interesting" error cases in a repo of JATS/PDF pairs, grobid-home/scripts/select_error_cases.py

Changed

  • Update to DeLFT version 0.3.2
  • Some more training data (authors in reference, segmentation, citation, reference-segmenter) (including #961, #864)
  • Update of some models, RNN with feature channels and CRF (segmentation, header, reference-segmenter, citation)
  • Review guidelines for segmentation model
  • Better URL matching, using in particular PDF URL annotation in account

Fixed

  • Fix unexpected figure and table labeling in short texts
  • When matching an ORCID to an author, prioritize Crossref info over extracted ORCID from the PDF (#838)
  • Annotation errors for acknowledgement and other minor stuff
  • Fix for Python library loading on Mac
  • Update docker file to support new CUDA key
  • Do not dehyphenize text in superscript or subscript
  • Allow absolute temporary paths
  • Fix redirected stderr from pdfalto not "gobbled" by the java ProcessBuilder call (#923)
  • Other minor fixes
grobid - Version 0.7.1

Published by kermitt2 over 2 years ago

Added

Changed

  • Update to DeLFT version to 0.3.1 and TensorFlow 2.7, with many improvements, see https://github.com/kermitt2/delft/releases/tag/v0.3.0
  • Update of Deep Learning models
  • Update of JEP and add install script
  • Update to new biblio-glutton version 0.2, for improved and faster bibliographical reference matching
  • circleci to replace Travis
  • Update of processFulltextAssetDocument service to use the same parameters as processFulltextDocument
  • Pre-compile regex if not already done
  • Review features for header model

Fixed

  • Improved date normalization (#760)
  • Fix possible issue with coordinates related to reference markers (#908) and sentence (#811)
  • Fix path to bitmap/vector graphics (#836)
  • Fix possible catastrophic regex backtracking (#867)
  • Other minor fixes
grobid - Version 0.7.0

Published by kermitt2 over 3 years ago

Added

  • New YAML configuration: all the settings are in one single yaml file, each model can be fully configured independently
  • Improvement of the segmentation and header models (for header, +1 F1-score for PMC evaluation, +4 F1-score for bioRxiv), improvements for body and citations
  • Add figure and table pop-up visualization on PDF in the console demo
  • Add PDF MD5 digest in the TEI results (service only)
  • Language support packages and xpdfrc file for pdfalto (support of CJK and exotic fonts)
  • Prometheus metrics
  • BidLSTM-CRF-FEATURES implementation available for more models
  • Addition of a "How GROBID works" page in the documentation

Changed

  • JitPack release (RIP jcenter)
  • Improved DOI cleaning
  • Speed improvement (around +10%), by factorizing some layout token manipulation
  • Update CrossRef requests implementation to align to the current usage of CrossRef's X-Rate-Limit-Limit response parameter

Fixed

  • Fix base url in demo console
  • Add missing pdfalto Graphics information when -noImage is used, fix graphics data path in TEI
  • Fix the tendency to merge tables when they are in close proximity
grobid - Version 0.6.2

Published by kermitt2 over 3 years ago

Added

  • Docker image covering both Deep Learning and CRF models, with GPU detection and preloading of embeddings
  • For Deep Learning models, labeling is now done by batch: application of the citation DL model is 4 times faster for BidLSTM-CRF (with or without features) and 6 times faster for SciBERT
  • More tests for sentence segmentation
  • Add orcid of persons when available from the PDF or via consolidation (i.e. if in CrossRef metadata)
  • Add BidLSTM-CRF-FEATURES header model (with feature channel)
  • Add bioRxiv end-to-end evaluation
  • Bounding boxes for optional section titles coordinates

Changed

  • Reduce the size of docker images
  • Improve end-to-end evaluation: multithreaded processing of PDF, progress bar, output the evaluation report in markdown format
  • Update of several models covering CRF, BidLSTM-CRF and BidLSTM-CRF-FEATURES, mainly improving citation and author recognitions
  • OpenNLP is the default optional sentence segmenter (similar result as Pragmatic Segmenter for scholar documents after benchmarking, but 30 times faster)
  • Refine sentence segmentation to exploit layout information and predicted reference callouts
  • Update jep version to 3.9.1

Fixed

  • Ignore invalid utf-8 sequences
  • Update CrossRef multithreaded calls to avoid using the unreliable time interval returned by the CrossRef REST API service, update usage of Crossref-Plus-API-Token and update the deprecated crossref field query.title
  • Missing last table or figure when generating training data for the fulltext model
  • Fix an error related to the feature value for the reference callout for the fulltext model
  • Review/correct DeLFT configuration documentation, with a step-by-step configuration documentation
  • Other minor fixes
grobid - Version 0.6.1

Published by kermitt2 about 4 years ago

Added

  • Support of line number (typically in preprints)
  • End-to-end evaluation and benchmark for preprints using the bioRxiv 10k dataset
  • Check whether PDF annotation is orcid and add orcid to author in the TEI result
  • Configuration for making sequence labeling engine (CRF Wapiti or Deep Learning) specific to models
  • Add a developers guide and a FAQ section in the documentation
  • Visualization of formulas on PDF layout in the demo console
  • Feature for subscript/superscript style in fulltext model

Changed

  • New significantly improved header model: with new features, new training data (600 new annotated examples, old training data is entirely removed), new labels and updated data structures in line with the other models
  • Update of the segmentation models with more training data
  • Removal of heuristics related to the header
  • Update to gradle 6.5.1 to support JDK 13 and 14
  • TEI schemas
  • Windows is not supported in this release

Fixed

  • Preserve affiliations after consolidation of the authors
  • Environment variable config override for all properties
  • Unfrequent duplication of the abstract in the TEI result
  • Incorrect merging of affiliations
  • Noisy parentheses in the bibliographical reference markers
  • In the console demo, fix the output filename wrongly taken from the input form when the text form is used
  • Synchronisation of the language detection singleton initialisation in case of multithread environment
  • Other minor fixes
grobid - Version 0.6.0

Published by kermitt2 over 4 years ago

Added

  • Table content structuring (thanks to @Vitaliy-1), see PR #546
  • Support for application/x-bibtex at /api/processReferences and /api/processCitation (thanks to @koppor)
  • Optionally include raw affiliation string in the TEI result
  • Add dummy model for facilitating test in Grobid modules
  • Allow environment variables for config properties values to ease Docker config
  • ChangeLog

Changed

  • Improve CORS configuration #527 (thank you @lfoppiano)
  • Documentation improvements
  • Update of segmentation and fulltext model and training data
  • Better handling of affiliation block fragments
  • Improved DOI string recognition
  • More robust n-fold cross validation (case of shared grobid-home)
grobid - Version 0.5.6

Published by kermitt2 about 5 years ago

  • Better abstract structuring (with citation contexts)
  • n-fold cross evaluation and better evaluation report (thanks to @lfoppiano)
  • Improved PMC ID and PMID recognition
  • Improved subscript/superscript and font style recognition (via pdfalto)
  • Improved JEP integration (support of python virtual environment for using DeLFT Deep Learning library, thanks @de-code and @lfoppiano)
  • Several bug fixes (thanks @de-code, @bnewbold, @Vitaliy-1 and @lfoppiano)
  • Improved dehyphenization (thanks to @lfoppiano)
grobid - Version 0.5.5

Published by kermitt2 over 5 years ago

  • Using pdfalto instead of pdf2xml for the first PDF parsing stage, with many improvements in robustness, ICU support, unknown glyph/font normalization
  • Improvement and full review of the integration of consolidation services, supporting biblio-glutton (additional identifiers and Open Access links) and Crossref REST API (add specific user agent, email and token for Crossref Metadata Plus)
  • Fix bounding box issues for some PDF #330
  • Updated lexicon #396
grobid - Version 0.5.4

Published by kermitt2 over 5 years ago

Changes:

  • transparent usage of DeLFT deep learning models (usual BidLSTM-CRF) instead of Wapiti CRF models, native integration via JEP

  • support of biblio-glutton as DOI/metadata matching service, alternative to crossref REST API

  • improvement of citation context identification and matching (+9% recall with similar precision, for PMC sample 1943 articles, from 43.35 correct citation contexts per article to 49.98 correct citation contexts per article)

  • citation callout now in abstract, figure and table captions

  • structured abstract (including update of TEI schema)

  • bug fixes and some more parameters: by default using all available threads when training and possibility to load models at the start of the service

grobid - Version 0.5.3

Published by kermitt2 almost 6 years ago

Changes:

  • Improvement of consolidation options and processing (better handling of CrossRef API, but the best is coming soon ;)
  • Better recall for figure and table identification (thanks to @detonator413)
  • Support of proxy for calling crossref with Apache HttpClient
  • Minor bugfixing
grobid - 0.5.2

Published by lfoppiano about 6 years ago

Changes:

  • Corrected back status codes from the REST API when no available engine (503 is back again to inform the client to wait, it was removed by error in version 0.5.0 and 0.5.1 for PDF processing services only, see documentation of the REST API)
  • Added metrics in the REST entrypoint (accessible via http://localhost:8071)
  • Added Grobid clients for Java, Python and NodeJS
  • Added counters for consolidation tasks and consolidation results
  • Add case sensitiveness option in lexicon/FastMatcher
  • Updated documentation
  • Bugfixing: #339, #322, #300, and other
grobid - Version 0.5.1 of GROBID

Published by lfoppiano over 6 years ago

Bug fixes

grobid - Version 0.5.0 of GROBID

Published by kermitt2 almost 7 years ago

The latest stable release of GROBID is version 0.5.0. As compared to previous version 0.4.3, this version brings:

  • Migrate from maven to gradle for faster, more flexible and more stable build, release, etc.
  • Usage of Dropwizard for web services
  • Move the Grobid service manual to readthedocs
  • (thanks to @detonator413 and @lfoppiano for this release! future work in versions 0.5.* will focus again on improving PDF parsing and structuring accuracy)
grobid - Version 0.4.4 of GROBID

Published by lfoppiano about 7 years ago

Fixed issue that was making the release build not working

grobid - Version 0.4.3 of GROBID

Published by kermitt2 about 7 years ago

The latest stable release of GROBID is version 0.4.3. As compared to previous version 0.4.2, this version brings:

  • New models: f-score improvement on the PubMed Central sample, bibliographical references +2.5%, header +7%
  • New training data and features for bibliographical references, in particular for covering HEP domain (INSPIRE), arXiv identifier, DOI and url (thanks @iorala and @michamos !)
  • Support for CrossRef REST API (instead of the slow OpenURL-style API which requires a CrossRef account), in particular for multithreading usage (thanks @Vi-dot)
  • Improve training data generation and documentation (thanks @jfix)
  • Unicode normalisation and more robust body extraction (thanks @aoboturov)
  • fixes, tests, documentation and update of the pdf2xml fork for Windows (thanks @lfoppiano)
grobid - Version 0.4.2 of GROBID

Published by kermitt2 about 7 years ago

Versions 0.4.2 of GROBID

grobid - Versions 0.4.1 of GROBID

Published by kermitt2 about 8 years ago

grobid - Latest stable version for versions 0.3.* of GROBID

Published by kermitt2 almost 9 years ago

Latest stable version for versions 0.3.* of GROBID