img2table

img2table is a table identification and extraction Python Library for PDF and images, based on OpenCV image processing

MIT License

Downloads
30K
Stars
527
Committers
2

img2table

img2table is a simple, easy to use, table identification and extraction Python Library based on OpenCV image processing that supports most common image file formats as well as PDF files.

Thanks to its design, it provides a practical and lighter alternative to Neural Networks based solutions, especially for usage on CPU.

Table of contents

Installation

The library can be installed via pip:

pip install img2table: Standard installation, supporting Tesseract pip install img2table[paddle]: For usage with Paddle OCR pip install img2table[easyocr]: For usage with EasyOCR pip install img2table[surya]: For usage with Surya OCR pip install img2table[gcp]: For usage with Google Vision OCR pip install img2table[aws]: For usage with AWS Textract OCR pip install img2table[azure]: For usage with Azure Cognitive Services OCR

Features

  • Table identification for images and PDF files, including bounding boxes at the table cell level
  • Handling of complex table structures such as merged cells
  • Handling of implicit content - see example
  • Table content extraction by providing support for OCR services / tools
  • Extracted tables are returned as a simple object, including a Pandas DataFrame representation
  • Export extracted tables to an Excel file, preserving their original structure

Supported file formats

Images

Images are loaded using the opencv-python library, supported formats are listed below.


PDF

Both native and scanned PDF files are supported.

Usage

Documents

Images

Images are instantiated as follows :

from img2table.document import Image

image = Image(src, 
              detect_rotation=False)

PDF

PDF files are instantiated as follows :

from img2table.document import PDF

pdf = PDF(src, 
          pages=[0, 2],
          detect_rotation=False,
          pdf_text_extraction=True)

PDF pages are converted to images with a 200 DPI for table identification.


OCR

img2table provides an interface for several OCR services and tools in order to parse table content. If possible (i.e for native PDF), PDF text will be extracted directly from the file and the OCR service/tool will not be called.

from img2table.ocr import TesseractOCR

ocr = TesseractOCR(n_threads=1, 
                   lang="eng", 
                   psm=11,
                   tessdata_dir="...")

Usage of Tesseract-OCR requires prior installation. Check documentation for instructions. For Windows users getting environment variable errors, you can check this tutorial

PaddleOCR is an open-source OCR based on Deep Learning models. At first use, relevant languages models will be downloaded.

from img2table.ocr import PaddleOCR

ocr = PaddleOCR(lang="en",
                kw={"kwarg": kw_value, ...})
# Example of installation with CUDA 11.8
pip install paddlepaddle-gpu==2.5.0rc1.post118 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
pip install paddleocr img2table

If you get an error trying to run PaddleOCR on Ubuntu, please check this issue for a working solution.

EasyOCR is an open-source OCR based on Deep Learning models. At first use, relevant languages models will be downloaded.

from img2table.ocr import EasyOCR

ocr = EasyOCR(lang=["en"],
              kw={"kwarg": kw_value, ...})

docTR is an open-source OCR based on Deep Learning models. In order to be used, docTR has to be installed by the user beforehand. Installation procedures are detailed in the package documentation

from img2table.ocr import DocTR

ocr = DocTR(detect_language=False,
            kw={"kwarg": kw_value, ...})

Only available for python >= 3.10 Surya is an open-source OCR based on Deep Learning models. At first use, relevant models will be downloaded.

from img2table.ocr import SuryaOCR

ocr = SuryaOCR(langs=["en"])

Authentication to GCP can be done by setting the standard GOOGLE_APPLICATION_CREDENTIALS environment variable. If this variable is missing, an API key should be provided via the api_key parameter.

from img2table.ocr import VisionOCR

ocr = VisionOCR(api_key="api_key", timeout=15)

When using AWS Textract, the DetectDocumentText API is exclusively called.

Authentication to AWS can be done by passing credentials to the TextractOCR class. If credentials are not provided, authentication is done using environment variables or configuration files. Check boto3 documentation for more details.

from img2table.ocr import TextractOCR

ocr = TextractOCR(aws_access_key_id="***",
                  aws_secret_access_key="***",
                  aws_session_token="***",
                  region="eu-west-1")
from img2table.ocr import AzureOCR

ocr = AzureOCR(endpoint="abc.azure.com",
               subscription_key="***")

Table extraction

Multiple tables can be extracted at once from a PDF page/ an image using the extract_tables method of a document.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Table extraction
extracted_tables = doc.extract_tables(ocr=ocr,
                                      implicit_rows=False,
                                      implicit_columns=False,
                                      borderless_tables=False,
                                      min_confidence=50)

NB: Borderless table extraction can, by design, only extract tables with 3 or more columns.

Method return

The ExtractedTable class is used to model extracted tables from documents.

In order to access bounding boxes at the cell level, you can use the following code snippet :

for id_row, row in enumerate(table.content.values()):
    for id_col, cell in enumerate(row):
        x1 = cell.bbox.x1
        y1 = cell.bbox.y1
        x2 = cell.bbox.x2
        y2 = cell.bbox.y2
        value = cell.value

extract_tables method from the Image class returns a list of ExtractedTable objects.

output = [ExtractedTable(...), ExtractedTable(...), ...]

extract_tables method from the PDF class returns an OrderedDict object with page indexes as keys and lists of ExtractedTable objects.

output = {
    0: [ExtractedTable(...), ...],
    1: [],
    ...
    last_page: [ExtractedTable(...), ...]
}

Excel export

Tables extracted from a document can be exported to a xlsx file. The resulting file is composed of one worksheet per extracted table. Method arguments are mostly common with the extract_tables method.

from img2table.ocr import TesseractOCR
from img2table.document import Image

# Instantiation of OCR
ocr = TesseractOCR(n_threads=1, lang="eng")

# Instantiation of document, either an image or a PDF
doc = Image(src)

# Extraction of tables and creation of a xlsx file containing tables
doc.to_xlsx(dest=dest,
            ocr=ocr,
            implicit_rows=False,
            implicit_columns=False,
            borderless_tables=False,
            min_confidence=50)

Examples

Several Jupyter notebooks with examples are available :

Caveats / FYI

Package Rankings
Top 7.07% on Pypi.org
Related Projects