extract text from any document. no muss. no fuss.
MIT License
Bot releases are visible (Hide)
Several updates. See changelog for details
Published by jpweytjens about 5 years ago
fix the msg
parser and update the Travis CI build
Published by jpweytjens over 5 years ago
update dependencies and make pocketsphinx
optional
Published by deanmalmgren over 7 years ago
documentation build fixes
Published by deanmalmgren over 7 years ago
psv/tsv parsers, user-provided filename extensions, audio parsing with pocketsphinx, and several other bug fixes
Published by deanmalmgren almost 8 years ago
python 3 compatability, improved docx extraction, improved image extraction, and more.
Published by deanmalmgren about 9 years ago
pdf layout preservation, extensionless file support, and several 🐛 fixes
Published by deanmalmgren over 9 years ago
Added .rtf and .msg support
Published by deanmalmgren over 9 years ago
Includes support for tiff files and a new --option/-O command line option to pass in arbitrary keyword arguments to parsers, like the language for tesseract OCR
Published by deanmalmgren about 10 years ago
support for a variety of formats, including audio (.wav, .mp3, .ogg), csv, scanned pdfs, and htm plus various bug fixes and internal improvements.
Published by deanmalmgren about 10 years ago
Bump in major release comes from a standardization of the byte-string output of textract. This also includes support for spreadsheets (.xls, .xlsx) and e-publications (.epub)
Published by deanmalmgren about 10 years ago
Fixed a few bugs and re-released.
Published by deanmalmgren about 10 years ago
Support for .json
, .odt
, .ps
, .gif
, .jpg
, .jpeg
, and .png
files
Published by deanmalmgren about 10 years ago
Bug fixes and support for .txt (haha)
Published by deanmalmgren about 10 years ago
Includes support for .html and .eml
Published by deanmalmgren over 10 years ago