Open Source Ecosystems

= PDF-table :toc:

== What is PDF-table? PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. + Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

== Prerequisites

=== JDK

JAVA 8 is required.

=== External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2 . Unpack it and add to your system PATH: * Windows: <opencv dir>\build\java\x64 * Linux: TODO

== Installation
[source, xml]

== Usage

=== Parsing PDFs When PDF document page is being parsed, following operations are performed:

. Page is converted to grayscale image [OpenCV]. . Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV]. . Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV]. . Contour mask is XORed with BIT image [OpenCV]. . Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV]. . Final contours are drawn [OpenCV]. . Bounding rectangles are detected from final contours [OpenCV]. . PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to <>

==== single-threaded example
[source, java]

class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}

==== multi-threaded example
[source, java]

class MultiThreadParser { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    // parse pages simultaneously
    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<ParsedTablePage>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<ParsedTablePage> callable = () -> {
            ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
            return page;
        };
        futures.add(executor.submit(callable));
    }

    // collect parsed pages
    List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
    try {
        for (Future<ParsedTablePage> f : futures) {
            ParsedTablePage page = f.get();
            unsortedParsedPages.add(page.getPageNum() - 1, page);
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }

    // sort pages by pageNum
    List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
            .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}

}

=== Saving PDF pages as PNG images PDF-Table provides methods for saving PDF pages as PNG images. + Rendering DPI can be modified in PdfTableSettings (see: <>).

==== single-threaded example
[source, java]

class SingleThreadPNGDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}

==== multi-threaded example
[source, java]

class MultiThreadPNGDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Saving debug PNG images When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing. + Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: <>).

==== single-threaded example
[source, java]

class SingleThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}

==== multi-threaded example
[source, java]

class MultiThreadDebugImgsDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

[source, java]

(...)

// build settings object PdfTableSettings settings = PdfTableSettings.getBuilder() .setCannyFiltering(true) .setCannyApertureSize(5) .setCannyThreshold1(40) .setCannyThreshold2(190.5) .setPdfRenderingDpi(160) .build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

=== Output format
Each parsed PDF page is being returned as `ParsedTablePage` object:
[source, java]

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 ! ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List // getting first row ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

Package Rankings

Top 33.16% on Repo1.maven.org

Related Projects

tabula-java

Extract tables from PDF files

22 May 2014 1,731

PDF-Editor-with-JavaFX

Build simple PDF Editor with Java

15 Feb 2021 8

PDFLayoutTextStripper

Converts a pdf file into a text file while keeping the layout of the original pdf. Useful to extr...

11 Oct 2015 1,569

AndroidPdfViewer

Android view for displaying PDFs rendered with PdfiumAndroid

02 Jun 2016 8,133

tabular4j

A library for reading and writing pan-format tabular data, especially spreadsheets, in Java 11+.

03 Jan 2023 1

innodb-java-reader

A library and command-line tool to access MySQL InnoDB data file directly in Java

07 Jan 2020 455

react-native-pdf

A <Pdf /> component for react-native

25 Apr 2017 1,591

pdf-table

== Installation [source, xml]

==== single-threaded example [source, java]

class SingleThreadParser { public static void main(String[] args) throws IOException { PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader(); List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages()); } }

==== multi-threaded example [source, java]

}

==== single-threaded example [source, java]

==== multi-threaded example [source, java]

}

==== single-threaded example [source, java]

==== multi-threaded example [source, java]

}

[source, java]

// pass settings to reader PdfTableReader reader = new PdfTableReader(settings);

=== Output format Each parsed PDF page is being returned as ParsedTablePage object: [source, java]

// cell content usually contain characters, // so it is recommended to trim them before processing double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());

Related Projects

tabula-java

PDF-Editor-with-JavaFX

PDFLayoutTextStripper

AndroidPdfViewer

tabular4j

innodb-java-reader

react-native-pdf

== Installation
[source, xml]

==== single-threaded example
[source, java]

class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}

==== multi-threaded example
[source, java]

==== single-threaded example
[source, java]

==== multi-threaded example
[source, java]

==== single-threaded example
[source, java]

==== multi-threaded example
[source, java]

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

=== Output format
Each parsed PDF page is being returned as `ParsedTablePage` object:
[source, java]

// cell content usually contain characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());