doc2text

Detect text blocks and OCR poorly scanned PDFs in bulk. Python module available via pip.

MIT License

Downloads

2.9K

Stars

1.3K

Committers

View Code on GitHub View on X

Ecosystems: Python

Commit Statistics

Past Year

All Time

Total Commits

Total Committers

Avg. Commits Per Committer

7.43

Bot Commits

Issue Statistics

Past Year

All Time

Total Pull Requests

Merged Pull Requests

Total Issues

Time to Close Issues

N/A

about 23 hours

Package Rankings

Top 4.98% on Pypi.org

Related Projects

parsevision

Parse vision is an open source tool to visualise what OCR is parsing in a PDF document to help de...

13 Jul 2024 51

normcap

OCR powered screen-capture tool to capture information instead of images

14 Aug 2019 1,905

Nkocr

🔎📝 This is a module to make specifics OCRs at food products and nutritional tables.

07 Jul 2020 34

open-parse

Improved file parsing for LLM’s

22 Mar 2024 2,405

surya

OCR, layout analysis, reading order, line detection in 90+ languages

10 Jan 2024 6,739

OCRmyPDF

OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched

20 Dec 2013 12,250

pdftabextract

A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) ...

08 Jul 2016 2,209

image-to-text

Images of Text to Text: Call Tesseract from Python and OCR a directory of pdfs

29 Jan 2015 15

easytextract

Easy to use text extractor, from PDF, DOC, DOCX and other documents, including if necessary using...

12 Nov 2017 6

EasyOCR

Ready-to-use OCR with 80+ supported languages and all popular writing scripts including Latin, Ch...

14 Mar 2020 22,979

LaTeX-OCR

pix2tex: Using a ViT to convert images of equations into LaTeX code.

11 Dec 2020 12,124

pytesseract

A Python wrapper for Google Tesseract

27 Oct 2010 5,782

texthero

Text preprocessing, representation and visualization from zero to hero.

06 Apr 2020 2,881

pdf2images

Convert pdf to pages of images

19 Jul 2019 11

docquery

An easy way to extract information from documents

08 Aug 2022 1,692