pdftext

Extract structured text from pdfs quickly

APACHE-2.0 License

Downloads

22.8K

Stars

317

View Code on GitHub View on X

Ecosystems: Python

pdftext - Better device coordinate extraction

Published by VikParuchuri 10 days ago

There were some cases where visual and text coordinates didn't align. This fixes that issue.

pdftext - Revert extraction changes

Published by VikParuchuri 10 days ago

pdftext - Python 3.13 compatibility

Published by VikParuchuri 11 days ago

pdftext - Ignore special chars, break lines more aggressively

Published by VikParuchuri 11 days ago

pdftext - Fix flattening bug

Published by VikParuchuri 20 days ago

pdftext - Fix document loading bug

Published by VikParuchuri 20 days ago

There was a bug where pdf paths were assumed to be strings - this is not always the case

pdftext - ONNX model, option to flatten form fields

Published by VikParuchuri 20 days ago

Faster inference with ONNX
Remove warning when loading scikit-learn model
Flatten form fields into pdf

pdftext - Fix bbox bug

Published by VikParuchuri 5 months ago

Fixed bug that didn't unnormalize bboxes properly.

pdftext - Minor performance optimizations Latest Release

Published by VikParuchuri 5 months ago

Optimize dictionary access and loops to get an ~10% speedup

pdftext - Add optional parallel workers

Published by VikParuchuri 5 months ago

Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers flag via CLI, or via the workers kwarg.

pdftext - Fix font issue

Published by VikParuchuri 6 months ago

Not all spans would have the right font information before. This fixes the issue.

pdftext - Work around pdfium bug

Published by VikParuchuri 6 months ago

Charbox has zero width/height when loose=True with rotation

pdftext - Fix font names

Published by VikParuchuri 6 months ago

Fix logic for pulling font names
Increase sample frequency

pdftext - Change line breaks

Published by VikParuchuri 6 months ago

Use line breaks from pdfium

pdftext - Add option to keep individual characters

Published by VikParuchuri 6 months ago

Option to keep characters with JSON/dictionary output
Fix some bugs when interfacing with the pdfium c api (thanks @mara004)

pdftext - Reduce block false positives

Published by VikParuchuri 6 months ago

Add probability threshold for block predictions

pdftext - Minor refactor; select page range

Published by VikParuchuri 6 months ago

Select a range of pages versus converting the whole doc
Minor internal refactor to use docs versus paths

pdftext - Improve output format, hyphen handling

Published by VikParuchuri 6 months ago

Fix bug where hyphens didn't show up at the end of lines
Improve wrapping for hyphens - join words across hyphens before newline (disable by passing keep_hyphens)
Restructure output to avoid redundant info in json blob - keep track of text spans with similar font info instead of individual characters
Update model to predict blocks more accurately

pdftext - Improve character bboxes

Published by VikParuchuri 6 months ago

Switch the character box to a loose box, to get the full character range

pdftext - Handle rotations

Published by VikParuchuri 6 months ago

Rotate bboxes if pdf is rotated

Package Rankings

Top 36.03% on Pypi.org

Related Projects

pdfrw

pdfrw is a pure Python library that reads and writes PDFs

30 May 2015 1,837

latexpages

Combine LaTeX docs into a single PDF

pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

29 Aug 2014 5,843

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipula...

06 Oct 2012 4,980

confectionary

a tool to quickly create sweet PDF files from text files

textdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interfac...

05 May 2017 3,302

marker

Convert PDF to markdown quickly with high accuracy

30 Oct 2023 15,511

pdf.tocgen

A CLI toolset to generate table of contents for PDF files automatically.

28 Jul 2020 647

pypdftk

Python module to drive the awesome pdftk binary.

14 Mar 2013 145

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of ...

06 Jan 2012 7,337

pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...

08 Jul 2016 2,209

open-parse

Improved file parsing for LLM’s

22 Mar 2024 2,405

surya

OCR, layout analysis, reading order, line detection in 90+ languages

10 Jan 2024 6,739

doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

28 Aug 2016 1,267

pdfconduit

Prepare documents for distribution