a tool to quickly create sweet PDF files from text files
APACHE-2.0 License
Work in the NLP domain and find that your end users/clients don't like using
.txt
files with your excellent results? Look no further!
PDF Confectionary is a tool for quickly creating templated PDFs from text files using FPDF2. Essentially, point it at a directory of text files, and generate some sweet PDFs.
Table of Contents
The focus of this repo is to provide a simple, easy-to-use, and extensible PDF creation tool. Relevant features in PDF Confectionary include:
textsplit
moduleThis module was inspired by the need to create clean output documents for reading & review speech transcription from the vid2cleantxt project. PDF Confectionary was initially designed as a command-line tool but provides a Python API for more advanced use cases.
Primary modules used by confectionary
are: FPDF2, textsplit, gensim, and clean-text.
All dependencies are listed in the requirements.txt
file.
The package can be installed using pip:
pip install confectionary
To install as a python package without pip, run:
git clone <https://github.com/pszemraj/confectionary.git>
cd confectionary
pip install -e .
There are two ways to use PDF Confectionary:
command line, via python confectionary/text2pdf.py -i <input_dir> -o <output_dir>
Python API via functions in the confectionary.text2pdf
module. The dir_to_pdf
function is the equivalent of the command line tool application.
Both create one pdf from all txt files in the input directory, saved to output_dir
. Add the -r
switch (or recurse=True
in function) to load files recursively.
Call python confectionary/text2pdf.py -i /path/to/input/dir -o /path/to/output/dir
to create a pdf from all txt files in the input directory and save it to the output directory:
python confectionary/text2pdf.py -i /path/to/input/dir -o /path/to/output/dir \
-kw "my keywords to label this document."
The below example shows the output of the command line tool and uses the -m
switch to specify a specific word2vec model.
$ python confectionary/text2pdf.py -i "example/text-files" -o "example/outputs" -kw "my keywords to label this document" \
-m "glove-wiki-gigaword-200"
Output:
Since the GPL-licensed package `unidecode` is not installed, using Pythons `unicodedata` package which yields worse results.
3 files found matching extension .txt
# entries is 3, < title thresh 39
will use one page for TOC
Building Chapters in PDF file: 0%| | 0/3 [00:00<?, ?it/s]
No local model file - downloading glove-wiki-gigaword-200 from gensim-data API
[==================================================] 100.0% 252.1/252.1MB downloaded
Loaded word2vec model glove-wiki-gigaword-200
Building Chapters in PDF file: 100%|████████████████████████████████████████████████████████████████████████████| 3/3 [01:23<00:00, 27.77s/it]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3484.61it/s]
PDF file written to example/outputs/text-to-PDF/my keywords to label this document_txt2pdf_Oct-18-2022_standard.pdf
Find out more about the command line tool by running python confectionary/text2pdf.py -h
.
Three basic functions are available in confectionary.text2pdf
: dir_to_pdf
, file_to_pdf
, and str_to_pdf
:
dir_to_pdf
takes a directory path and creates a pdf from all txt files in the directory.file_to_pdf
takes a file path and creates a pdf from the file.str_to_pdf
takes a string and creates a pdf from the string.Details on the function arguments can be found in the relevant function docstrings (or call help()
). To replicate the above command line usage, use the following code:
from confectionary.text2pdf import dir_to_pdf
report_path = dir_to_pdf(
input_dir="/path/to/input/dir",
output_dir="/path/to/output/dir",
keywords="my keywords to label this document",
)
print(f"Report saved to {report_path}")
Check out the dir_to_pdf
docstring for more options:
import inspect
from confectionary.text2pdf import dir_to_pdf
inspect.getdoc(dir_to_pdf)
Splitting input text into paragraphs is enabled by default and uses a word2vec model. If it doesn't exist, it will be downloaded via gensim
's API and saved to the ./models
directory.
dir_to_pdf
function via the word2vec_model
argument.
glove-wiki-gigaword-100
and is a 100-dimensional model and has a download size of ~130 MB.--api-info
flag to the command line tool or calling the confectionary.utils.print_api_info()
function.do_paragraph_splitting
parameter to False
or, in command line mode, by adding the --no-split
switch.text2pdf.py
script to a module/functionApache License 2.0