unstructured | Langchain Ecosystem Directory

Bot releases are visible (Hide)

unstructured - 0.8.4

Published by Klaijan about 1 year ago

0.8.4

Enhancements

Additional tests and refactor of JSON detection.
Update functionality to retrieve image metadata from a page for document_to_element_list
Links are now tracked in partition_html output.
Set the file's current position to the beginning after reading the file in convert_to_bytes
Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
set the file's current position to the beginning after reading the file in convert_to_bytes
Add slide notes to pptx
Add --encoding directive to ingest
Improve json detection by detect_filetype

Features

Adds Outlook connector
Add support for dpi parameter in inference library
Adds Onedrive connector.
Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

Fixes issue with email partitioning where From field was being assigned the To field value.
Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
Adds .txt, .text, and .tab to list of extensions to check if file
has a text/plain MIME type.
Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
Removed old error message that's superseded by requires_dependencies.
Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api

unstructured -

Published by rbiseck3 over 1 year ago

0.8.1

Enhancements

Add support for Python 3.11

Features

Fixes

Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
Fix list detection in MS Word documents.
Don't instantiate an element with a coordinate system when there isn't a way to get its location data.

unstructured - 0.8.0

Published by rbiseck3 over 1 year ago

Enhancements

Allow model used for hi res pdf partition strategy to be chosen when called.
Updated inference package

Features

Add metadata_filename parameter across all partition functions

Fixes

Adjust encoding recognition threshold value in detect_file_encoding
Fix KeyError when isd_to_elements doesn't find a type
Fix _output_filename for local connector, allowing single files to be written correctly to the disk
Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.

unstructured - 0.7.12

Published by tabossert over 1 year ago

0.7.12

Enhancements

Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

Adds Dropbox connector

Fixes

Fix tests that call unstructured-api by passing through an api-key
Fixed page breaks being given (incorrect) page numbers
Fix skipping download on ingest when a source document exists locally

unstructured - 0.7.11

Published by cragwolfe over 1 year ago

0.7.11

Enhancements

More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
Make large model available (from unstructured-inference bump to 0.5.3)
Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
partition_email and partition_msg will now process attachments if process_attachments=True
and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

Fix tests that call unstructured-api by passing through an api-key
Fixed page breaks being given (incorrect) page numbers
Fix skipping download on ingest when a source document exists locally

unstructured - 0.7.10

Published by MthwRobinson over 1 year ago

0.7.10

Enhancements

Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
partition_msg and partition_xml that sets a limit for the size of an individual
document elements. Defaults to 1500 for everything except partition_xml, which has
a default value of None.
DRY connector refactor

Features

hi_res model for pdfs and images is selectable via environment variable.

Fixes

CSV check now ignores escaped commas.
Fix for filetype exploration util when file content does not have a comma.
Adds negative lookahead to bullet pattern to avoid detecting plain text line
breaks like ------- as list items.
Fix pre tag parsing for partition_html
Fix lookup error for annotated Arabic and Hebrew encodings

unstructured - 0.7.9

Published by cragwolfe over 1 year ago

0.7.9

Enhancements

Improvements to string check for leafs in partition_xml.
Adds --partition-ocr-languages to unstructured-ingest.

Features

Adds partition_org for processed Org Mode documents.

Fixes

unstructured - 0.7.8

Published by cragwolfe over 1 year ago

0.7.8

Enhancements

Features

Adds Google Cloud Service connector

Fixes

Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
partition_email now works if there is no message content
Updates the "fast" strategy for partition_pdf so that it's able to recursively
Adds recursive functionality to all fsspec connectors
Adds generic --recursive ingest flag

unstructured - 0.7.7

Published by MthwRobinson over 1 year ago

0.7.7

Enhancements

Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs
Adds missed file-like object handling in detect_file_encoding
Adds functionality to extract charset info from eml files

Features

Added coordinate system class to track coordinate types and convert to different coordinate

Fixes

Adds an html_assemble_articles kwarg to partition_html to enable users to capture
control whether content outside of <article> tags is captured when
<article> tags are present.
Check for the xml attribute on element before looking for pagebreaks in partition_docx.

unstructured - 0.7.6

Published by yuming-long over 1 year ago

0.7.6

Enhancements

Convert fast startegy to ocr_only for images
Adds support for page numbers in .docx and .doc when user or renderer
created page breaks are present.
Adds retry logic for the unstructured-ingest Biomed connector

Features

Provides users with the ability to extract additional metadata via regex.
Updates partition_docx to include headers and footers in the output.
Create partition_tsv and associated tests. Make additional changes to detect_filetype.

Fixes

Remove fake api key in test partition_via_api since we now require valid/empty api keys
Page number defaults to None instead of 1 when page number is not present in the metadata.
A page number of None indicates that page numbers are not being tracked for the document
or that page numbers do not apply to the element in question..
Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
in case the shape.top and shape.left attributes are None.

unstructured - 0.7.5

Published by cragwolfe over 1 year ago

0.7.5

Enhancements

Adds functionality to sort elements in partition_pdf for fast strategy
Adds ingest tests with --fast strategy on PDF documents
Adds --api-key to unstructured-ingest

Features

Adds partition_rst for processed ReStructured Text documents.

Fixes

Adds handling for emails that do not have a datetime to extract.
Adds pdf2image package as core requirement of unstructured (with no extras)

unstructured - 0.7.4

Published by yuming-long over 1 year ago

0.7.4

Enhancements

Allows passing kwargs to request data field for partition_via_api and partition_multiple_via_api
Enable MIME type detection if libmagic is not available
Adds handling for empty files in detect_filetype and partition.

Features

Fixes

Reslove grpcio import issue on weaviate.schema.validate_schema for python 3.9 and 3.10
Remove building detectron2 from source in Dockerfile

unstructured - 0.7.3

Published by yuming-long over 1 year ago

0.7.3

Enhancements

Update IngestDoc abstractions and add data source metadata in ElementMetadata

Features

Fixes

Pass strategy parameter down from partition for partition_image
Filetype detection if a CSV has a text/plain MIME type
convert_office_doc no longers prints file conversion info messages to stdout.
partition_via_api reflects the actual filetype for the file processed in the API.

unstructured - 0.7.2

Published by MthwRobinson over 1 year ago

0.7.2

Enhancements

Adds an optional encoding kwarg to elements_to_json and elements_from_json
Bump version of base image to use new stable version of tesseract

Features

Fixes

Update the read_txt_file utility function to keep using spooled_to_bytes_io_if_needed for xml
Add functionality to the read_txt_file utility function to handle file-like object from URL
Remove the unused parameter encoding from partition_pdf
Change auto.py to have a None default for encoding
Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
Adds benchmark test with test docs in example-docs
Re-enable test_upload_label_studio_data_with_sdk
File detection now detects code files as plain text
Adds tabulate explicitly to dependencies
Fixes an issue in metadata.page_number of pptx files
Adds showing help if no parameters passed

unstructured - 0.7.1

Published by MthwRobinson over 1 year ago

0.7.1

Enhancements

Features

Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with
a helper function for defining a class to use in Weaviate schemas.
Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes

unstructured - 0.7.0

Published by MthwRobinson over 1 year ago

0.7.0

Enhancements

Installing detectron2 from source is no longer required when using the local-inference extra.
Updates .pptx parsing to include text in tables.

Features

Fixes

Fixes an issue in _add_element_metadata that caused all elements to have page_number=1
in the element metadata.
Adds .log as a file extension for TXT files.
Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
Allow passed encoding to be used in the replace_mime_encodings
Fixes page metadata for partition_html when include_metadata=False
A ValueError now raises if file_filename is not specified when you use partition_via_api
with a file-like object.

unstructured - 0.6.11

Published by yuming-long over 1 year ago

0.6.11

Enhancements

Supports epub tests since pandoc is updated in base image

Features

Fixes

unstructured - 0.6.10

Published by cragwolfe over 1 year ago

0.6.10

Enhancements

XLS support from auto partition

Features

Fixes

unstructured - 0.6.9

Published by qued over 1 year ago

0.6.9

Enhancements

fast strategy for pdf now keeps element bounding box data
setup.py refactor

Features

Fixes

Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
Adds additional MIME types for CSV

unstructured - 0.6.8

Published by MthwRobinson over 1 year ago