pdf-table

Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV

MIT License

Stars
69

= PDF-table :toc:

== What is PDF-table? PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. + Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.

== Prerequisites

=== JDK

JAVA 8 is required.

=== External dependencies

pdf-table requires compiled OpenCV 3.4.2 to work properly:

. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2 . Unpack it and add to your system PATH: * Windows: <opencv dir>\build\java\x64 * Linux: TODO

== Installation
[source, xml]

== Usage

=== Parsing PDFs When PDF document page is being parsed, following operations are performed:

. Page is converted to grayscale image [OpenCV]. . Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV]. . Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV]. . Contour mask is XORed with BIT image [OpenCV]. . Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV]. . Final contours are drawn [OpenCV]. . Bounding rectangles are detected from final contours [OpenCV]. . PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].

Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.

For more information about parsed output, refer to <>

==== single-threaded example
[source, java]

class SingleThreadParser {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
PdfTableReader reader = new PdfTableReader();
List parsed = reader.parsePdfTablePages(pdfDoc, 1, pdfDoc.getNumberOfPages());
}
}

==== multi-threaded example
[source, java]

class MultiThreadParser { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    // parse pages simultaneously
    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<ParsedTablePage>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<ParsedTablePage> callable = () -> {
            ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
            return page;
        };
        futures.add(executor.submit(callable));
    }

    // collect parsed pages
    List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
    try {
        for (Future<ParsedTablePage> f : futures) {
            ParsedTablePage page = f.get();
            unsortedParsedPages.add(page.getPageNum() - 1, page);
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }

    // sort pages by pageNum
    List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
            .sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}

}

=== Saving PDF pages as PNG images PDF-Table provides methods for saving PDF pages as PNG images. + Rendering DPI can be modified in PdfTableSettings (see: <>).

==== single-threaded example
[source, java]

class SingleThreadPNGDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfPagesAsPNG(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}

==== multi-threaded example
[source, java]

class MultiThreadPNGDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Saving debug PNG images When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page at various stages of processing. + Using these images, user can adjust PdfTableSettings accordingly to achieve desired results (see: <>).

==== single-threaded example
[source, java]

class SingleThreadDebugImgsDump {
public static void main(String[] args) throws IOException {
PDDocument pdfDoc = PDDocument.load(new File("some.pdf"));
Path outputPath = Paths.get("C:", "some_directory");
PdfTableReader reader = new PdfTableReader();
reader.savePdfTablePagesDebugImages(pdfDoc, 1, pdfDoc.getNumberOfPages(), outputPath);
}
}

==== multi-threaded example
[source, java]

class MultiThreadDebugImgsDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

    ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
    List<Future<Boolean>> futures = new ArrayList<>();
    for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
        Callable<Boolean> callable = () -> {
            reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
            return true;
        };
        futures.add(executor.submit(callable));
    }

    try {
        for (Future<Boolean> f : futures) {
            f.get();
        }
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

}

=== Parsing settings

PDF rendering and OpenCV filtering settings are stored in PdfTableSettings object.

Custom settings instance can be passed to PdfTableReader constructor when non-default values are needed:

[source, java]

(...)

// build settings object PdfTableSettings settings = PdfTableSettings.getBuilder() .setCannyFiltering(true) .setCannyApertureSize(5) .setCannyThreshold1(40) .setCannyThreshold2(190.5) .setPdfRenderingDpi(160) .build();

// pass settings to reader
PdfTableReader reader = new PdfTableReader(settings);

=== Output format
Each parsed PDF page is being returned as ParsedTablePage object:
[source, java]

(...)

PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();

// first page in document has index == 1, not 0 ! ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);

// getting page number assert firstPage.getPageNum() == 1;

// rows and cells are zero-indexed just like elements of the List // getting first row ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);

// getting third cell in second row String thirdCellContent = firstPage.getRow(1).getCell(2);

// cell content usually contain characters,
// so it is recommended to trim them before processing
double thirdCellNumericValue = Double.valueOf(thirdCellContent.trim());