Library in C++ and a python wrapper for dealing with Page XML files
MIT License
Check py-pagexml/README.rst and/or docker/Dockerfile_build, docker/Dockerfile_runtime.
Document structure detection from PAGE-XML to METS-XML
Python package for managing OHDSI clinical data models. Includes support for LLM based plain text...
Single API for reading, manipulating and writing data in csv, ods, xls, xlsx and xlsm files
Combine LaTeX docs into a single PDF
ODF backend for AsciiDoc
OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched
A simple previewer for various markup formats.
A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...
Transforms PDF, Documents and Images into Enriched Structured Data
Improved file parsing for LLM’s
Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
Creation and manipulation of Open XML documents (mainly docx).
Parse vision is an open source tool to visualise what OCR is parsing in a PDF document to help de...
Excel Integration with spaCy. Training NER using Excel/XLSX from PDF, DOCX, PPT, PNG or JPG.
Python module that makes working with XML feel like you are working with JSON