pdftext

Extract structured text from pdfs quickly

APACHE-2.0 License

Downloads
22.8K
Stars
317
pdftext - Speed improvements

Published by VikParuchuri 6 months ago

  • Optimize some internal routines
  • Improve the model further
pdftext - Improve model

Published by VikParuchuri 6 months ago

  • Added a few extra line-related features
  • Improved accuracy of the model
pdftext - Initial release

Published by VikParuchuri 6 months ago

Initial version of pdftext. Fast text extraction based on pypdfium2.

  • Extract plain text, sorted into reading order or in pdf order
  • Extract structured blocks and lines with font and other information per-character