pdftext

Extract structured text from pdfs quickly

APACHE-2.0 License

Downloads

22.8K

Stars

317

View Code on GitHub View on X

Ecosystems: Python

pdftext - Better device coordinate extraction

Published by VikParuchuri 10 days ago

There were some cases where visual and text coordinates didn't align. This fixes that issue.

pdftext - Revert extraction changes

Published by VikParuchuri 11 days ago

pdftext - Python 3.13 compatibility

Published by VikParuchuri 11 days ago

pdftext - Ignore special chars, break lines more aggressively

Published by VikParuchuri 11 days ago

pdftext - Fix flattening bug

Published by VikParuchuri 20 days ago

pdftext - Fix document loading bug

Published by VikParuchuri 20 days ago

There was a bug where pdf paths were assumed to be strings - this is not always the case

pdftext - ONNX model, option to flatten form fields

Published by VikParuchuri 20 days ago

Faster inference with ONNX
Remove warning when loading scikit-learn model
Flatten form fields into pdf

pdftext - Fix bbox bug

Published by VikParuchuri 5 months ago

Fixed bug that didn't unnormalize bboxes properly.

pdftext - Minor performance optimizations Latest Release

Published by VikParuchuri 5 months ago

Optimize dictionary access and loops to get an ~10% speedup

pdftext - Add optional parallel workers

Published by VikParuchuri 5 months ago

Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers flag via CLI, or via the workers kwarg.

pdftext - Fix font issue

Published by VikParuchuri 6 months ago

Not all spans would have the right font information before. This fixes the issue.

pdftext - Work around pdfium bug

Published by VikParuchuri 6 months ago

Charbox has zero width/height when loose=True with rotation

pdftext - Fix font names

Published by VikParuchuri 6 months ago

Fix logic for pulling font names
Increase sample frequency

pdftext - Change line breaks

Published by VikParuchuri 6 months ago

Use line breaks from pdfium

pdftext - Add option to keep individual characters

Published by VikParuchuri 6 months ago

Option to keep characters with JSON/dictionary output
Fix some bugs when interfacing with the pdfium c api (thanks @mara004)

pdftext - Reduce block false positives

Published by VikParuchuri 6 months ago

Add probability threshold for block predictions

pdftext - Minor refactor; select page range

Published by VikParuchuri 6 months ago

Select a range of pages versus converting the whole doc
Minor internal refactor to use docs versus paths

pdftext - Improve output format, hyphen handling

Published by VikParuchuri 6 months ago

Fix bug where hyphens didn't show up at the end of lines
Improve wrapping for hyphens - join words across hyphens before newline (disable by passing keep_hyphens)
Restructure output to avoid redundant info in json blob - keep track of text spans with similar font info instead of individual characters
Update model to predict blocks more accurately

pdftext - Improve character bboxes

Published by VikParuchuri 6 months ago

Switch the character box to a loose box, to get the full character range

pdftext - Handle rotations

Published by VikParuchuri 6 months ago

Rotate bboxes if pdf is rotated

Package Rankings

Top 36.03% on Pypi.org

Related Projects

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipula...

06 Oct 2012 4,980

pdfje

🌷 Write beautiful PDFs in declarative Python

confectionary

a tool to quickly create sweet PDF files from text files

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of ...

06 Jan 2012 7,337

pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...

08 Jul 2016 2,209

pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

29 Aug 2014 5,843

marker

Convert PDF to markdown quickly with high accuracy

30 Oct 2023 15,511

open-parse

Improved file parsing for LLM’s

22 Mar 2024 2,405

pdfrw

pdfrw is a pure Python library that reads and writes PDFs

30 May 2015 1,837

textdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interfac...

05 May 2017 3,302

pdf.tocgen

A CLI toolset to generate table of contents for PDF files automatically.

28 Jul 2020 647

latexpages

Combine LaTeX docs into a single PDF

surya

OCR, layout analysis, reading order, line detection in 90+ languages

10 Jan 2024 6,739

doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

28 Aug 2016 1,267

pypdftk

Python module to drive the awesome pdftk binary.

14 Mar 2013 145