textract

extract text from any document. no muss. no fuss.

MIT License

Downloads
300.2K
Stars
3.8K
Committers
38

Bot releases are hidden (Show)

textract - v1.6.4 Latest Release

Published by deanmalmgren about 3 years ago

Several updates. See changelog for details

textract - v1.6.3

Published by jpweytjens about 5 years ago

fix the msg parser and update the Travis CI build

textract - v1.6.2

Published by jpweytjens over 5 years ago

update dependencies and make pocketsphinx optional

textract - v1.6.1

Published by deanmalmgren over 7 years ago

documentation build fixes

textract - v1.6.0

Published by deanmalmgren over 7 years ago

psv/tsv parsers, user-provided filename extensions, audio parsing with pocketsphinx, and several other bug fixes

textract - v1.5.0

Published by deanmalmgren almost 8 years ago

python 3 compatability, improved docx extraction, improved image extraction, and more.

textract - v1.4.0

Published by deanmalmgren about 9 years ago

pdf layout preservation, extensionless file support, and several 🐛 fixes

textract - v1.3.0

Published by deanmalmgren over 9 years ago

Added .rtf and .msg support

textract - v1.2.0

Published by deanmalmgren over 9 years ago

Includes support for tiff files and a new --option/-O command line option to pass in arbitrary keyword arguments to parsers, like the language for tesseract OCR

textract - v1.1.0

Published by deanmalmgren about 10 years ago

support for a variety of formats, including audio (.wav, .mp3, .ogg), csv, scanned pdfs, and htm plus various bug fixes and internal improvements.

textract - v1.0.0

Published by deanmalmgren about 10 years ago

Bump in major release comes from a standardization of the byte-string output of textract. This also includes support for spreadsheets (.xls, .xlsx) and e-publications (.epub)

textract - v0.5.1

Published by deanmalmgren about 10 years ago

Fixed a few bugs and re-released.

textract - v0.5.0

Published by deanmalmgren about 10 years ago

Support for .json, .odt, .ps, .gif, .jpg, .jpeg, and .png files

textract - v0.3.0

Published by deanmalmgren about 10 years ago

Bug fixes and support for .txt (haha)

textract - v0.4.0

Published by deanmalmgren about 10 years ago

Includes support for .html and .eml

textract - v0.2.0

Published by deanmalmgren over 10 years ago