Extract structured text from pdfs quickly
APACHE-2.0 License
Published by VikParuchuri 10 days ago
There were some cases where visual and text coordinates didn't align. This fixes that issue.
Published by VikParuchuri 11 days ago
Published by VikParuchuri 11 days ago
Published by VikParuchuri 11 days ago
Published by VikParuchuri 20 days ago
Published by VikParuchuri 20 days ago
Published by VikParuchuri 20 days ago
Published by VikParuchuri 5 months ago
Fixed bug that didn't unnormalize bboxes properly.
Published by VikParuchuri 5 months ago
Enable optional parallel workers when extracting text. This can cause a performance hit on small pdfs, but can speed things up 2x or more on larger ones. This can be done with the --workers
flag via CLI, or via the workers kwarg.
Published by VikParuchuri 6 months ago
Not all spans would have the right font information before. This fixes the issue.
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
Published by VikParuchuri 6 months ago
keep_hyphens
)Published by VikParuchuri 6 months ago
loose
box, to get the full character rangePublished by VikParuchuri 6 months ago