pdftext

Extract structured text from pdfs quickly

APACHE-2.0 License

Downloads

22.8K

Stars

317

View Code on GitHub View on X

Ecosystems: Python

Issue Statistics

Past Year

All Time

Total Pull Requests

Merged Pull Requests

Total Issues

Time to Close Issues

19 days

Package Rankings

Top 36.03% on Pypi.org

Related Projects

marker

Convert PDF to markdown quickly with high accuracy

30 Oct 2023 15,511

pdfrw

pdfrw is a pure Python library that reads and writes PDFs

30 May 2015 1,837

pypdftk

Python module to drive the awesome pdftk binary.

14 Mar 2013 145

pdfminer.six

Community maintained fork of pdfminer - we fathom PDF

29 Aug 2014 5,843

pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...

08 Jul 2016 2,209

pypdf

A pure-python PDF library capable of splitting, merging, cropping, and transforming the pages of ...

06 Jan 2012 7,337

PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipula...

06 Oct 2012 4,980

pdfconduit

Prepare documents for distribution

20 Jul 2018 24

surya

OCR, layout analysis, reading order, line detection in 90+ languages

10 Jan 2024 6,739

open-parse

Improved file parsing for LLM’s

22 Mar 2024 2,405

confectionary

a tool to quickly create sweet PDF files from text files

15 Feb 2022 3

pdf.tocgen

A CLI toolset to generate table of contents for PDF files automatically.

28 Jul 2020 647

latexpages

Combine LaTeX docs into a single PDF

17 Aug 2014 3

textdistance

📐 Compute distance between sequences. 30+ algorithms, pure python implementation, common interfac...

05 May 2017 3,302

doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

28 Aug 2016 1,267