pdftext

Extract structured text from pdfs quickly

APACHE-2.0 License

Downloads
22.8K
Stars
317
pdftext - Better device coordinate extraction

Published by VikParuchuri 10 days ago

There were some cases where visual and text coordinates didn't align. This fixes that issue.

pdftext - Revert extraction changes

Published by VikParuchuri 10 days ago

pdftext - Python 3.13 compatibility

Published by VikParuchuri 11 days ago

pdftext - Ignore special chars, break lines more aggressively

Published by VikParuchuri 11 days ago

pdftext - Fix flattening bug

Published by VikParuchuri 20 days ago

pdftext - Fix document loading bug

Published by VikParuchuri 20 days ago

  • There was a bug where pdf paths were assumed to be strings - this is not always the case
pdftext - ONNX model, option to flatten form fields

Published by VikParuchuri 20 days ago

  • Faster inference with ONNX
  • Remove warning when loading scikit-learn model
  • Flatten form fields into pdf
pdftext - Fix bbox bug

Published by VikParuchuri 5 months ago

Fixed bug that didn't unnormalize bboxes properly.

pdftext - Minor performance optimizations Latest Release

Published by VikParuchuri 5 months ago

  • Optimize dictionary access and loops to get an ~10% speedup
pdftext - Add optional parallel workers

Published by VikParuchuri 5 months ago

Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers flag via CLI, or via the workers kwarg.

pdftext - Fix font issue

Published by VikParuchuri 6 months ago

Not all spans would have the right font information before. This fixes the issue.

pdftext - Work around pdfium bug

Published by VikParuchuri 6 months ago

  • Charbox has zero width/height when loose=True with rotation
pdftext - Fix font names

Published by VikParuchuri 6 months ago

  • Fix logic for pulling font names
  • Increase sample frequency
pdftext - Change line breaks

Published by VikParuchuri 6 months ago

  • Use line breaks from pdfium
pdftext - Add option to keep individual characters

Published by VikParuchuri 6 months ago

  • Option to keep characters with JSON/dictionary output
  • Fix some bugs when interfacing with the pdfium c api (thanks @mara004)
pdftext - Reduce block false positives

Published by VikParuchuri 6 months ago

  • Add probability threshold for block predictions
pdftext - Minor refactor; select page range

Published by VikParuchuri 6 months ago

  • Select a range of pages versus converting the whole doc
  • Minor internal refactor to use docs versus paths
pdftext - Improve output format, hyphen handling

Published by VikParuchuri 6 months ago

  • Fix bug where hyphens didn't show up at the end of lines
  • Improve wrapping for hyphens - join words across hyphens before newline (disable by passing keep_hyphens)
  • Restructure output to avoid redundant info in json blob - keep track of text spans with similar font info instead of individual characters
  • Update model to predict blocks more accurately
pdftext - Improve character bboxes

Published by VikParuchuri 6 months ago

  • Switch the character box to a loose box, to get the full character range
pdftext - Handle rotations

Published by VikParuchuri 6 months ago

  • Rotate bboxes if pdf is rotated