Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
APACHE-2.0 License
Bot releases are hidden (Show)
pillow-heif
with pi-heif
. Replaces pillow-heif
with pi-heif
due to more permissive licensing on the wheel for pi-heif
..metadata.text_as_html
for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate
that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.filetype
was incorrectly identified as a MSG file.Published by christinestraub 2 months ago
Published by MthwRobinson 2 months ago
nltk
version to resolve CVE.ingest-test-fixture-update-pr
to resolve NLTK model download errors.TableChunk
splits. When a Table
element is divided during chunking to fit the chunking window, TableChunk.text
corresponds exactly with the table text in TableChunk.metadata.text_as_html
, .text_as_html
is always parseable HTML, and the table is split on even row boundaries whenever possible.Published by MthwRobinson 2 months ago
unstructured.pytesseract
fork. Due to the unavailability of some recent release versions of pytesseract
on PyPI, the project now uses the unstructured.pytesseract
fork to ensure stability and continued support.libreoffice
verson in image. Bumps the libreoffice
version to 25.2.5.2
to address CVEs.nltk==3.8.2
on PyPI, the NLTK dependency has been downgraded to <3.8.2
. This change ensures continued functionality and compatibility.Published by christinestraub 2 months ago
pytesseract>=0.3.12
that occurred during pip install unstructured[pdf]==0.15.3
.Published by christinestraub 2 months ago
extra-paddleocr.in
to resolve the error in the setup.py
configuration.Published by MthwRobinson 2 months ago
figures
directory is no longer created when the extract_image_block_to_payload
parameter is set to True
.nltk>=3.8.2
. The NLTK data file now container punkt_tab
, making it possible to upgrade to nltk>=3.8.2
. The nltk==3.8.2
patches CVE-2024-39705.partition_csv()
where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).image/jpg
in PPTX as alias for image/jpeg
. Resolves problem partitioning PPTX files having an invalid image/jpg
(should be image/jpeg
) MIME-type in the [Content_Types].xml
member of the PPTX Zip archive.Published by christinestraub 3 months ago
pdfminer
embedded image
extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction in pdf
partitioning.Recipient
elements are generated for cc and bcc when include_headers=True
for email partitioning.pdf_hi_res_max_pages
argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the high_res
strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.HuggingFaceEmbeddingEncoder
to use HuggingFaceEmbeddings
from langchain_huggingface
package instead of the deprecated version from langchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain.OpenAIEmbeddingEncoder
to use OpenAIEmbeddings
from langchain-openai
package instead of the deprecated version from langchain-community
. This resolves the deprecation warning and ensures compatibility with future versions of langchain.detect_filetype()
no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead FileNotFoundError
is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.partition()
as a file-path was identified as TXT and partitioned using partition_text()
. EML files specified by path are now identified and processed correctly, including processing any attachments.partition()
would raise when gzip
compression was used for transport by the server.partition()
with a swapped MS-Office content_type
would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received by partition()
is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type.Published by christinestraub 3 months ago
=\n
and =\r\n
characters during the clearing process. Previously, only =\n
characters were removed.<p>
, <div>
) nested inside a phrasing element (e.g. <strong>
or <cite>
). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.partition_pdf()
. Extend language specification capability to PaddleOCR
in addition to TesseractOCR
. Users can now specify OCR languages for both OCR engines when using partition_pdf()
.nltk
binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile
where accessing the temporary file by name raises PermissionError
.Published by MthwRobinson 3 months ago
.doc
files are now supported in the arm64
image.. libreoffice24
is added to the arm64
image, meaning .doc
files are now supported. We have follow on work planned to investigate adding .ppt
support for arm64
as well.nltk.download
in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705Published by MthwRobinson 4 months ago
hi_res
strategy the analysis
parameter can be used to visualize the result of the OD model and dump the result to a file. Additionally, the visualization of bounding boxes of each layout source is rendered and saved for each page.partition_docx()
distinguishes "file not found" from "not a ZIP archive" error. partition_docx()
now provides different error messages for "file not found" and "file is not a ZIP archive (and therefore not a DOCX file)". This aids diagnosis since these two conditions generally point in different directions as to the cause and fix.soffice
processes could be attempted Add a wait mechanism in convert_office_doc
so that the function first checks if another soffice
is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to run soffice
partition()
now forwards strategy
arg to partition_docx()
, partition_pptx()
, and their brokering partitioners for DOC, ODT, and PPT formats. A strategy
argument passed to partition()
(or the default value "auto" assigned by partition()
) is now forwarded to partition_docx()
, partition_pptx()
, and their brokering partitioners when those filetypes are detected.Published by MthwRobinson 4 months ago
arm64
image now runs on wolfi-base
. The arm64
build for wolfi-base
does not yet include libreoffce
, and so arm64
does not currently support processing .doc
, .ppt
, or .xls
file. If you need to process those files on arm64
, use the legacy rockylinux
image.Bump unstructured-inference==0.7.36 Fix ValueError
when converting cells to html.
partition()
now forwards strategy
arg to partition_docx()
, partition_ppt()
, and partition_pptx()
. A strategy
argument passed to partition()
(or the default value "auto" assigned by partition()
) is now forwarded to partition_docx()
, partition_ppt()
, and partition_pptx()
when those filetypes are detected.
Fix missing sensitive field markers for embedders
Published by MthwRobinson 4 months ago
wolfi-base
image. The amd64 image now pulls from the unstructured
wolfi-base
image to avoid duplication of dependency setup steps.extract_image_block_types
and starting_page_number
.Published by christinestraub 4 months ago
overwrite_schema
kwarg from Delta Table connector.. The overwrite_schema
kwarg is deprecated in deltalake>=0.18.0
. schema_mode=
should be used now instead. schema_mode="overwrite"
is equivalent to overwrite_schema=True
and schema_mode="merge"
is equivalent to overwrite_schema="False"
. schema_mode
defaults to None
. You can also now specify engine
, which defaults to "pyarrow"
. You need to specify enginer="rust"
to use "schema_mode"
.partition_via_api
Published by MthwRobinson 4 months ago
.tar.gz
files. This was added to the Python tarfile
lib in Python 3.12. The change only applies when using Python 3.12 and above.python-oxmsg
for partition_msg()
. Outlook MSG emails are now partitioned using the python-oxmsg
package which resolves some shortcomings of the prior MSG parser.partition_msg()
is now able to parse non-unicode Outlook MSG emails.partition_msg()
is now able to extract attachments without corruption.Published by christinestraub 5 months ago
UnstructuredTableTransformerModel
When a table is not recognized, the element.metadata.text_as_html
attribute is set to an empty string.python-docx
Pinned python-docx
version to ensure a particular method unstructured
uses is included.Published by christinestraub 5 months ago
category
field from Text class to Element class.partition_docx()
now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCX Paragraph
and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.partition_pdf()
to keep spaces in the text. The control character \t
is now replaced with a space instead of being removed when merging inferred elements with embedded elements.resolve_entities=False
for XML parsing with lxml
form_extraction_skip_tables
argument to the partition_pdf_or_image
call.table_as_cells
output by default to reduce overhead in partition; now table_as_cells
is only produced when the env EXTACT_TABLE_AS_CELLS
is true
document_to_element_list
for handling HTMLDocument Use getattr(element, "type", "")
to get the type
attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try blockPublished by christinestraub 5 months ago
pinecone
connector.Published by christinestraub 5 months ago
unstructured-inference
to unstructured
.pinecone
connectorunstructured
now works with Python 3.12!Published by christinestraub 5 months ago
partition_pdf()
. Skip element sorting when determining whether embedded text can be extracted.partition_docx()
. Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx()
is aware of the requested strategy.NotImplementedError
.paragraph_grouper
can be set to False
, but the type hint did not not reflect this previously.links
is extracted during partitioning and is not needed as a paramter in partition_pdf.partition_csv()
would raise on CSV files with very long lines.partition_doc()
. Remove temporary file created but not removed when file
argument is passed to partition_doc()
.SyntaxError
or SyntaxWarning
on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.partition_odt()
. Remove temporary file created but not removed when file
argument is passed to partition_odt()
.