unstructured | Langchain Ecosystem Directory

Bot releases are hidden (Show)

unstructured - 0.15.8 Latest Release

Published by MthwRobinson about 2 months ago

0.15.8

Enhancements

Bump unstructured.paddleocr to 2.8.1.0.

Features

Add MixedbreadAI embedder Adds MixedbreadAI embeddings to support embedding via Mixedbread AI.

Fixes

Replace pillow-heif with pi-heif. Replaces pillow-heif with pi-heif due to more permissive licensing on the wheel for pi-heif.
Minify text_as_html from DOCX. Previously .metadata.text_as_html for DOCX tables was "bloated" with whitespace and noise elements introduced by tabulate that produced over-chunking and lower "semantic density" of elements. Reduce HTML to minimum character count without preserving all text.
Fall back to filename extension-based file-type detection for unidentified OLE files. Resolves a problem where a DOC file that could not be detected as such by filetype was incorrectly identified as a MSG file.

unstructured - 0.15.7

Published by christinestraub 2 months ago

0.15.7

Enhancements

Features

Fixes

Fix NLTK data download path to prevent nested directories. Resolved an issue where a nested "nltk_data" directory was created within the parent "nltk_data" directory when it already existed. This fix prevents errors in checking for existing downloads and loading models from NLTK data.

unstructured - 0.15.6

Published by MthwRobinson 2 months ago

0.15.6

Enhancements

Features

Fixes

Bump to NLTK 3.9.x Bumps to the latest nltk version to resolve CVE.
Update CI for ingest-test-fixture-update-pr to resolve NLTK model download errors.
Synchronized text and html on TableChunk splits. When a Table element is divided during chunking to fit the chunking window, TableChunk.text corresponds exactly with the table text in TableChunk.metadata.text_as_html, .text_as_html is always parseable HTML, and the table is split on even row boundaries whenever possible.

unstructured - 0.15.5

Published by MthwRobinson 2 months ago

0.15.5

Enhancements

Features

Fixes

Revert to using unstructured.pytesseract fork. Due to the unavailability of some recent release versions of pytesseract on PyPI, the project now uses the unstructured.pytesseract fork to ensure stability and continued support.
Bump libreoffice verson in image. Bumps the libreoffice version to 25.2.5.2 to address CVEs.
Downgrade NLTK dependency version for compatibility. Due to the unavailability of nltk==3.8.2 on PyPI, the NLTK dependency has been downgraded to <3.8.2. This change ensures continued functionality and compatibility.

unstructured - 0.15.4

Published by christinestraub 2 months ago

0.15.4

Enhancements

Features

Fixes

Resolve an installation error with pytesseract>=0.3.12 that occurred during pip install unstructured[pdf]==0.15.3.

unstructured - 0.15.3

Published by christinestraub 2 months ago

0.15.3

Enhancements

Features

Fixes

Remove the custom index URL from extra-paddleocr.in to resolve the error in the setup.py configuration.

unstructured - 0.15.2

Published by MthwRobinson 2 months ago

0.15.2

Enhancements

Improve directory handling when extracting image blocks. The figures directory is no longer created when the extract_image_block_to_payload parameter is set to True.

Features

Added per-class Object Detection metrics in the evaluation. The metrics include average precision, precision, recall, and f1-score for each class in the dataset.

Fixes

Updates NLTK data file for compatibility with nltk>=3.8.2. The NLTK data file now container punkt_tab, making it possible to upgrade to nltk>=3.8.2. The nltk==3.8.2 patches CVE-2024-39705.
Renames Astra to Astra DB Conforms with DataStax internal naming conventions.
Accommodate single-column CSV files. Resolves a limitation of partition_csv() where delimiter detection would fail on a single-column CSV file (which naturally has no delimeters).
Accommodate image/jpg in PPTX as alias for image/jpeg. Resolves problem partitioning PPTX files having an invalid image/jpg (should be image/jpeg) MIME-type in the [Content_Types].xml member of the PPTX Zip archive.
Fixes an issue in Object Detection metrics The issue was in preprocessing/validating the ground truth and predicted data for object detection metrics.
Removes dependency on unstructured.pytesseract Unstructured forked pytesseract while waiting for code to be upstreamed. Now that the new version has been released, this fork can be removed.

unstructured - 0.15.1

Published by christinestraub 3 months ago

0.15.1

Enhancements

Improve pdfminer embedded image extraction to exclude text elements and produce more accurate bounding boxes. This results in cleaner, more precise element extraction in pdf partitioning.

Features

Update partition_eml and partition_msg to capture cc, bcc, and message_id fields Cc, bcc, and message_id information is captured in element metadata for both msg and email partitioning and Recipient elements are generated for cc and bcc when include_headers=True for email partitioning.
Mark ingest as deprecated Begin sunset of ingest code in this repo as it's been moved to a dedicated repo.
Add pdf_hi_res_max_pages argument for partitioning, which allows rejecting PDF files that exceed this page number limit, when the high_res strategy is chosen. By default, it will allow parsing PDF files with an unlimited number of pages.

Fixes

Update HuggingFaceEmbeddingEncoder to use HuggingFaceEmbeddings from langchain_huggingface package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
Update OpenAIEmbeddingEncoder to use OpenAIEmbeddings from langchain-openai package instead of the deprecated version from langchain-community. This resolves the deprecation warning and ensures compatibility with future versions of langchain.
Update import of Pinecone exception Adds compatibility for pinecone-client>=5.0.0
File-type detection catches non-existent file-path. detect_filetype() no longer silently falls back to detecting a file-type based on the extension when no file exists at the path provided. Instead FileNotFoundError is raised. This provides consistent user notification of a mis-typed path rather than an unpredictable exception from a file-type specific partitioner when the file cannot be opened.
EML files specified as a file-path are detected correctly. Resolved a bug where an EML file submitted to partition() as a file-path was identified as TXT and partitioned using partition_text(). EML files specified by path are now identified and processed correctly, including processing any attachments.
A DOCX, PPTX, or XLSX file specified by path and ambiguously identified as MIME-type "application/octet-stream" is identified correctly. Resolves a shortcoming where a file specified by path immediately fell back to filename-extension based identification when misidentified as "application/octet-stream", either by asserted content type or a mis-guess by libmagic. An MS Office file misidentified in this way is now correctly identified regardless of its filename and whether it is specified by path or file-like object.
Textual content retrieved from a URL with gzip transport compression now partitions correctly. Resolves a bug where a textual file-type (such as Markdown) retrieved by passing a URL to partition() would raise when gzip compression was used for transport by the server.
A DOCX, PPTX, or XLSX content-type asserted on partition is confirmed or fixed. Resolves a bug where calling partition() with a swapped MS-Office content_type would cause the file-type to be misidentified. A DOCX, PPTX, or XLSX MIME-type received by partition() is now checked for accuracy and corrected if the file is for a different MS-Office 2007+ type.
DOC, PPT, XLS, and MSG files are now auto-detected correctly. Resolves a bug where DOC, PPT, and XLS files were auto-detected as MSG files under certain circumstances.

unstructured - 0.15.0

Published by christinestraub 3 months ago

0.15.0

Enhancements

Improve text clearing process in email partitioning. Updated the email partitioner to remove both =\n and =\r\n characters during the clearing process. Previously, only =\n characters were removed.
Bump unstructured.paddleocr to 2.8.0.1.
Refine HTML parser to accommodate block element nested in phrasing. HTML parser no longer raises on a block element (e.g. <p>, <div>) nested inside a phrasing element (e.g. <strong> or <cite>). Instead it breaks the phrasing run (and therefore element) at the block-item start and begins a new phrasing run after the block-item. This is consistent with how the browser determines element boundaries in this situation.
Install rewritten HTML parser to fix 12 existing bugs and provide headroom for refinement and growth. A rewritten HTML parser resolves a collection of outstanding bugs with HTML partitioning and provides a firm foundation for further elaborating that important partitioner.
CI check for dependency licenses Adds a CI check to ensure dependencies are appropriately licensed.

Features

Add support for specifying OCR language to partition_pdf(). Extend language specification capability to PaddleOCR in addition to TesseractOCR. Users can now specify OCR languages for both OCR engines when using partition_pdf().
Add AstraDB source connector Adds support for ingesting documents from AstraDB.

Fixes

Remedy error on Windows when nltk binaries are downloaded. Work around a quirk in the Windows implementation of tempfile.NamedTemporaryFile where accessing the temporary file by name raises PermissionError.
Move Astra embedded_dimension to write config

unstructured - 0.14.10

Published by MthwRobinson 3 months ago

0.14.10

Enhancements

Update unstructured-client dependency Change unstructured-client dependency pin back to
greater than min version and updated tests that were failing given the update.
.doc files are now supported in the arm64 image.. libreoffice24 is added to the arm64 image, meaning .doc files are now supported. We have follow on work planned to investigate adding .ppt support for arm64 as well.
Add table detection metrics: recall, precision and f1
Remove unused _with_spans metrics

Features

Fixes

Fix counting false negatives and false positives in table structure evaluation
Fix Slack CI test Change channel that Slack test is pointing to because previous test bot expired
Remove NLTK download Removes nltk.download in favor of downloading from an S3 bucket we host to mitigate CVE-2024-39705

unstructured - 0.14.9

Published by MthwRobinson 4 months ago

0.14.9

Enhancements

Added visualization and OD model result dump for PDF In PDF hi_res strategy the analysis parameter can be used to visualize the result of the OD model and dump the result to a file. Additionally, the visualization of bounding boxes of each layout source is rendered and saved for each page.
partition_docx() distinguishes "file not found" from "not a ZIP archive" error. partition_docx() now provides different error messages for "file not found" and "file is not a ZIP archive (and therefore not a DOCX file)". This aids diagnosis since these two conditions generally point in different directions as to the cause and fix.

Features

Fixes

Fix a bug where multiple soffice processes could be attempted Add a wait mechanism in convert_office_doc so that the function first checks if another soffice is running already: if yes wait till the other process finishes or till the wait timeout before spawning a subprocess to run soffice
partition() now forwards strategy arg to partition_docx(), partition_pptx(), and their brokering partitioners for DOC, ODT, and PPT formats. A strategy argument passed to partition() (or the default value "auto" assigned by partition()) is now forwarded to partition_docx(), partition_pptx(), and their brokering partitioners when those filetypes are detected.

unstructured - 0.14.8

Published by MthwRobinson 4 months ago

0.14.8

Enhancements

Move arm64 image to wolfi-base The arm64 image now runs on wolfi-base. The arm64 build for wolfi-base does not yet include libreoffce, and so arm64 does not currently support processing .doc, .ppt, or .xls file. If you need to process those files on arm64, use the legacy rockylinux image.

Features

Fixes

Bump unstructured-inference==0.7.36 Fix ValueError when converting cells to html.
partition() now forwards strategy arg to partition_docx(), partition_ppt(), and partition_pptx(). A strategy argument passed to partition() (or the default value "auto" assigned by partition()) is now forwarded to partition_docx(), partition_ppt(), and partition_pptx() when those filetypes are detected.
Fix missing sensitive field markers for embedders

unstructured - 0.14.7

Published by MthwRobinson 4 months ago

0.14.7

Enhancements

Pull from wolfi-base image. The amd64 image now pulls from the unstructured wolfi-base image to avoid duplication of dependency setup steps.
Fix windows temp file. Make the creation of a temp file in unstructured/partition/pdf_image/ocr.py windows compatible.

Features

Expose conversion functions for tables Adds public functions to convert tables from HTML to the Deckerd format and back

Fixes

Fix an error publishing docker images. Update user in docker-smoke-test to reflect changes made by the amd64 image pull from the "unstructured" "wolfi-base" image.
**Fix a IndexError when partitioning a pdf with values for both extract_image_block_types and starting_page_number.

unstructured - 0.14.6

Published by christinestraub 4 months ago

0.14.6

Enhancements

Bump unstructured-inference==0.7.35 Fix syntax for generated HTML tables.

Features

tqdm ingest support add optional flag to ingest flow to print out progress bar of each step in the process.

Fixes

Remove deprecated overwrite_schema kwarg from Delta Table connector.. The overwrite_schema kwarg is deprecated in deltalake>=0.18.0. schema_mode= should be used now instead. schema_mode="overwrite" is equivalent to overwrite_schema=True and schema_mode="merge" is equivalent to overwrite_schema="False". schema_mode defaults to None. You can also now specify engine, which defaults to "pyarrow". You need to specify enginer="rust" to use "schema_mode".
Fix passing parameters to python-client - Remove parsing list arguments to strings in passing arguments to python-client in Ingest workflow and partition_via_api
table metric bug fix get_element_level_alignment()now will find all the matched indices in predicted table data instead of only returning the first match in the case of multiple matches for the same gt string.
fsspec connector path/permissions bug V2 fsspec connectors were failing when defined relative filepaths had leading slash. This strips that slash to guarantee the relative path never has it.
Dropbox connector internal file path bugs Dropbox source connector currently raises exceptions when indexing files due to two issues: a path formatting idiosyncrasy of the Dropbox library and a divergence in the definition of the Dropbox libraries fs.info method, expecting a 'url' parameter rather than 'path'.
update table metric evaluation to handle corrected HTML syntax for tables This change is connected to the update in unstructured-inference change - fixes transforming HTML table to deckerd and internal cells format.

unstructured - 0.14.5

Published by MthwRobinson 4 months ago

0.14.5

Enhancements

Filtering for tar extraction Adds tar filtering to the compression module for connectors to avoid decompression malicious content in .tar.gz files. This was added to the Python tarfile lib in Python 3.12. The change only applies when using Python 3.12 and above.
Use python-oxmsg for partition_msg(). Outlook MSG emails are now partitioned using the python-oxmsg package which resolves some shortcomings of the prior MSG parser.

Features

Fixes

8-bit string Outlook MSG files are parsed. partition_msg() is now able to parse non-unicode Outlook MSG emails.
Attachments to Outlook MSG files are extracted intact. partition_msg() is now able to extract attachments without corruption.

unstructured - 0.14.4

Published by christinestraub 5 months ago

Enhancements

Move logger error to debug level when PDFminer fails to extract text which includes error message for Invalid dictionary construct.
Add support for Pinecone serverless Adds Pinecone serverless to the connector tests. Pinecone
serverless will work version versions >=0.14.2, but hadn't been tested until now.

Features

Allow configuration of the Google Vision API endpoint Add an environment variable to select the Google Vision API in the US or the EU.

Fixes

Address the issue of unrecognized tables in UnstructuredTableTransformerModel When a table is not recognized, the element.metadata.text_as_html attribute is set to an empty string.
Remove root handlers in ingest logger. Removes root handlers in ingest loggers to ensure secrets aren't accidentally exposed in Colab notebooks.
Fix V2 S3 Destination Connector authentication Fixes bugs with S3 Destination Connector where the connection config was neither registered nor properly deserialized.
Clarified dependence on particular version of python-docx Pinned python-docx version to ensure a particular method unstructured uses is included.
Ingest preserves original file extension Ingest V2 introduced a change that dropped the original extension for upgraded connectors. This reverts that change.

unstructured - 0.14.3

Published by christinestraub 5 months ago

Enhancements

Move category field from Text class to Element class.
partition_docx() now supports pluggable picture sub-partitioners. A subpartitioner that accepts a DOCX Paragraph and generates elements is now supported. This allows adding a custom sub-partitioner that extracts images and applies OCR or summarization for the image.
Add VoyageAI embedder Adds VoyageAI embeddings to support embedding via Voyage AI.

Features

Fixes

Fix partition_pdf() to keep spaces in the text. The control character \t is now replaced with a space instead of being removed when merging inferred elements with embedded elements.
Turn off XML resolve entities Sets resolve_entities=False for XML parsing with lxml
to avoid text being dynamically injected into the XML document.
Add backward compatibility for the deprecated pdf_infer_table_structure parameter.
Add the missing form_extraction_skip_tables argument to the partition_pdf_or_image call.
to avoid text being dynamically injected into the XML document.
Chromadb change from Add to Upsert using element_id to make idempotent
Diable table_as_cells output by default to reduce overhead in partition; now table_as_cells is only produced when the env EXTACT_TABLE_AS_CELLS is true
Reduce excessive logging Change per page ocr info level logging into detail level trace logging
Replace try block in document_to_element_list for handling HTMLDocument Use getattr(element, "type", "") to get the type attribute of an element when it exists. This is more explicit way to handle the special case for HTML documents and prevents other types of attribute error from being silenced by the try block

unstructured - 0.14.2

Published by christinestraub 5 months ago

Enhancements

Bump unstructured-inference==0.7.33.

Features

Add attribution to the pinecone connector.

unstructured - 0.14.1

Published by christinestraub 5 months ago

Enhancements

Refactor code related to embedded text extraction. The embedded text extraction code is moved from unstructured-inference to unstructured.

Features

Large improvements to the ingest process:
- Support for multiprocessing and async, with limits for both.
- Streamlined to process when mapping CLI invocations to the underlying code
- More granular steps introduced to give better control over process (i.e. dedicated step to uncompress files already in the local filesystem, new optional staging step before upload)
- Use the python client when calling the unstructured api for partitioning or chunking
- Saving the final content is now a dedicated destination connector (local) set as the default if none are provided. Avoids adding new files locally if uploading elsewhere.
- Leverage last modified date when deciding if new files should be downloaded and reprocessed.
- Add attribution to the pinecone connector
Add support for Python 3.12. unstructured now works with Python 3.12!

unstructured - 0.14.0

Published by christinestraub 5 months ago

0.14.0

BREAKING CHANGES

Turn table extraction for PDFs and images off by default. Reverting the default behavior for table extraction to "off" for PDFs and images. A number of users didn't realize we made the change and were impacted by slower processing times due to the extra model call for table extraction.

Enhancements

Skip unnecessary element sorting in partition_pdf(). Skip element sorting when determining whether embedded text can be extracted.
Faster evaluation Support for concurrent processing of documents during evaluation
Add strategy parameter to partition_docx(). Behavior of future enhancements may be sensitive the partitioning strategy. Add this parameter so partition_docx() is aware of the requested strategy.
Add GLOBAL_WORKING_DIR and GLOBAL_WORKING_PROCESS_DIR configuration parameteres to control temporary storage.

Features

Add form extraction basics (document elements and placeholder code in partition). This is to lay the ground work for the future. Form extraction models are not currently available in the library. An attempt to use this functionality will end in a NotImplementedError.

Fixes

Add missing starting_page_num param to partition_image
Make the filename and file params for partition_image and partition_pdf match the other partitioners
Fix include_slide_notes and include_page_breaks params in partition_ppt
Re-apply: skip accuracy calculation feature Overwritten by mistake
Fix type hint for paragraph_grouper param paragraph_grouper can be set to False, but the type hint did not not reflect this previously.
Remove links param from partition_pdf links is extracted during partitioning and is not needed as a paramter in partition_pdf.
Improve CSV delimeter detection. partition_csv() would raise on CSV files with very long lines.
Fix disk-space leak in partition_doc(). Remove temporary file created but not removed when file argument is passed to partition_doc().
Fix possible SyntaxError or SyntaxWarning on regex patterns. Change regex patterns to raw strings to avoid these warnings/errors in Python 3.11+.
Fix disk-space leak in partition_odt(). Remove temporary file created but not removed when file argument is passed to partition_odt().
AstraDB: option to prevent indexing metadata