Java utility for parsing PDF tabular data using Apache PDFBox and OpenCV
MIT License
= PDF-table :toc:
== What is PDF-table? PDF-table is Java utility library that can be used for parsing tabular data in PDF documents. + Core processing of PDF documents is performed with utilization of Apache PDFBox and OpenCV.
== Prerequisites
=== JDK
JAVA 8 is required.
=== External dependencies
pdf-table requires compiled OpenCV 3.4.2 to work properly:
. Download OpenCV v3.4.2 from https://github.com/opencv/opencv/releases/tag/3.4.2
. Unpack it and add to your system PATH:
* Windows: <opencv dir>\build\java\x64
* Linux: TODO
== Usage
=== Parsing PDFs When PDF document page is being parsed, following operations are performed:
. Page is converted to grayscale image [OpenCV]. . Binary Inverted Threshold (BIT) is applied to grayscaled image [OpenCV]. . Contours are detected on BIT image and contour mask is created (additional Canny filtering can be turned on in this step) [OpenCV]. . Contour mask is XORed with BIT image [OpenCV]. . Contours are detected once again on XORed image (additional Canny filtering can be turned on in this step) [OpenCV]. . Final contours are drawn [OpenCV]. . Bounding rectangles are detected from final contours [OpenCV]. . PDF is being parsed region-by-region using bounding rectangles coordinates [Apache PDFBox].
Above algorithm is mostly derived from http://stackoverflow.com/a/23106594.
For more information about parsed output, refer to <>
class MultiThreadParser { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();
// parse pages simultaneously
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<ParsedTablePage>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<ParsedTablePage> callable = () -> {
ParsedTablePage page = reader.parsePdfTablePage(pdfDoc, pageNum);
return page;
};
futures.add(executor.submit(callable));
}
// collect parsed pages
List<ParsedTablePage> unsortedParsedPages = new ArrayList<>(pdfDoc.getNumberOfPages());
try {
for (Future<ParsedTablePage> f : futures) {
ParsedTablePage page = f.get();
unsortedParsedPages.add(page.getPageNum() - 1, page);
}
} catch (Exception e) {
throw new RuntimeException(e);
}
// sort pages by pageNum
List<ParsedTablePage> sortedParsedPages = unsortedParsedPages.stream()
.sorted((p1, p2) -> Integer.compare(p1.getPageNum(), p2.getPageNum())).collect(Collectors.toList());
}
=== Saving PDF pages as PNG images
PDF-Table provides methods for saving PDF pages as PNG images. +
Rendering DPI can be modified in PdfTableSettings
(see: <>).
class MultiThreadPNGDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<Boolean>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<Boolean> callable = () -> {
reader.savePdfPageAsPNG(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}
try {
for (Future<Boolean> f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
=== Saving debug PNG images
When tables in PDF document cannot be parsed correctly with default settings, user can save debug images that show page
at various stages of processing. +
Using these images, user can adjust PdfTableSettings
accordingly to achieve desired results
(see: <>).
class MultiThreadDebugImgsDump { public static void main(String[] args) throws IOException { final int THREAD_COUNT = 8; Path outputPath = Paths.get("C:", "some_directory"); PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();
ExecutorService executor = Executors.newFixedThreadPool(THREAD_COUNT);
List<Future<Boolean>> futures = new ArrayList<>();
for (final int pageNum : IntStream.rangeClosed(1, pdfDoc.getNumberOfPages()).toArray()) {
Callable<Boolean> callable = () -> {
reader.savePdfTablePagesDebugImage(pdfDoc, pageNum, outputPath);
return true;
};
futures.add(executor.submit(callable));
}
try {
for (Future<Boolean> f : futures) {
f.get();
}
} catch (Exception e) {
throw new RuntimeException(e);
}
}
=== Parsing settings
PDF rendering and OpenCV filtering settings are stored in PdfTableSettings
object.
Custom settings instance can be passed to PdfTableReader
constructor when non-default values are needed:
(...)
// build settings object PdfTableSettings settings = PdfTableSettings.getBuilder() .setCannyFiltering(true) .setCannyApertureSize(5) .setCannyThreshold1(40) .setCannyThreshold2(190.5) .setPdfRenderingDpi(160) .build();
ParsedTablePage
object:(...)
PDDocument pdfDoc = PDDocument.load(new File("some.pdf")); PdfTableReader reader = new PdfTableReader();
// first page in document has index == 1, not 0 ! ParsedTablePage firstPage = reader.parsePdfTablePage(pdfDoc, 1);
// getting page number assert firstPage.getPageNum() == 1;
// rows and cells are zero-indexed just like elements of the List // getting first row ParsedTablePage.ParsedTableRow firstRow = firstPage.getRow(0);
// getting third cell in second row String thirdCellContent = firstPage.getRow(1).getCell(2);