unstructured | Langchain Ecosystem Directory

Bot releases are visible (Hide)

unstructured - 0.10.12

Published by cragwolfe about 1 year ago

0.10.12

Enhancements

Removed PIL pin as issue has been resolved upstream
Bump unstructured-inference
- Support for yolox_quantized layout detection model (0.5.20)
YoloX element types added

Features

Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead

Fixes

Bump unstructured-inference
- Avoid divide-by-zero errors swith safe_division (0.5.21)

unstructured - 0.10.11

Published by cragwolfe about 1 year ago

0.10.11

Enhancements

Bump unstructured-inference
- Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)

Features

Add in ingest cli s3 writer

Fixes

Fix a bug where xy-cut sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when all elements have valid coordinates

unstructured - 0.10.10

Published by cragwolfe about 1 year ago

0.10.10

Enhancements

Adds text as an input parameter to partition_xml.
partition_xml no longer runs through partition_text, avoiding incorrect splitting
on carriage returns in the XML. Since partition_xml no longer calls partition_text,
min_partition and max_partition are no longer supported in partition_xml.
Bump unstructured-inference==0.5.18, change non-default detectron2 classification threshold
Upgrade base image from rockylinux 8 to rockylinux 9
Serialize IngestDocs to JSON when passing to subprocesses

Features

Fixes

Fix a bug where mismatched elements and bboxes are passed into add_pytesseract_bbox_to_elements

unstructured - 0.10.9

Published by cragwolfe about 1 year ago

0.10.9

Enhancements

Fix test_json to handle only non-extra dependencies file types (plain-text)

Features

Adds chunk_by_title to break a document into sections based on the presence of Title
elements.

Fixes

Make cv2 dependency optional
Edit add_pytesseract_bbox_to_elements's (ocr_only strategy) metadata.coordinates.points return type to Tuple for consistency.
Re-enable test-ingest-confluence-diff for ingest tests
Fix syntax for ingest test check number of files

unstructured - 0.10.8

Published by cragwolfe about 1 year ago

0.10.8

Enhancements

Release docker image that installs Python 3.10 rather than 3.8

Features

Fixes

unstructured - 0.10.7

Published by cragwolfe about 1 year ago

0.10.7

Enhancements

Features

Fixes

Remove overly aggressive ListItem chunking for images and PDF's which typically resulted in inchorent elements.

unstructured - 0.10.6

Published by cragwolfe about 1 year ago

0.10.6

Enhancements

Enable partition_email and partition_msg to detect if an email is PGP encryped. If
and email is PGP encryped, the functions will return an empy list of elements and
emit a warning about the encrypted content.
Add threaded Slack conversations into Slack connector output
Add functionality to sort elements using xy-cut sorting approach in partition_pdf for hi_res and fast strategies
Bump unstructured-inference
- Set OMP_THREAD_LIMIT to 1 if not set for better tesseract perf (0.5.17)

Features

Extract coordinates from PDFs and images when using OCR only strategy and add to metadata

Fixes

Update partition_html to respect the order of <pre> tags.
Fix bug in partition_pdf_or_image where two partitions were called if strategy == "ocr_only".
Bump unstructured-inference
- Fix issue where temporary files were being left behind (0.5.16)
Adds deprecation warning for the file_filename kwarg to partition, partition_via_api,
and partition_multiple_via_api.
Fix documentation build workflow by pinning dependencies

unstructured - 0.10.5

Published by cragwolfe about 1 year ago

0.10.5

Enhancements

partition raises an error and tells the user to install the appropriate extra if a filetype
is detected that is missing dependencies.
Add custom errors to ingest
Bump unstructured-ingest==0.5.15
- Handle an uncaught TesseractError (0.5.15)
- Add TIFF test file and TIFF filetype to test_from_image_file in test_layout (0.5.14)
Use entire_page ocr mode for pdfs and images
Add notes on extra installs to docs

Features

Add delta table connector

Fixes

unstructured - 0.10.4

Published by awalker4 about 1 year ago

0.10.4

Enhancements

Adds ability to reuse connections per process in unstructured-ingest
Pass ocr_mode in partition_pdf and set the default back to individual pages for now

Features

Fixes

unstructured - 0.10.2

Published by cragwolfe about 1 year ago

0.10.2

Enhancements

Bump unstructured-inference==0.5.13:
- Fix extracted image elements being included in layout merge, addresses the issue
  where an entire-page image in a PDF was not passed to the layout model when using hi_res.

Features

Fixes

unstructured - 0.10.1

Published by cragwolfe about 1 year ago

0.10.1

Enhancements

Bump unstructured-inference==0.5.12:
- fix to avoid trace for certain PDF's (0.5.12)
- better defaults for DPI for hi_res and Chipper (0.5.11)
- implement full-page OCR (0.5.10)

Features

Fixes

Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)
Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)

unstructured - 0.10.0

Published by cragwolfe about 1 year ago

0.10.0

Enhancements

Update the links and emphasized_texts metadata fields

Features

Fixes

unstructured - 0.9.3

Published by cragwolfe about 1 year ago

0.9.3

Enhancements

Pinned dependency cleanup.
Update partition_csv to always use soupparser_fromstring to parse html text
Update partition_tsv to always use soupparser_fromstring to parse html text
Add metadata.section to capture epub table of contents data
Add unique_element_ids kwarg to partition functions. If True, will use a UUID
for element IDs instead of a SHA-256 hash.
Update partition_xlsx to always use soupparser_fromstring to parse html text
Add functionality to switch html text parser based on whether the html text contains emoji
Add functionality to check if a string contains any emoji characters

Features

Add Airtable Connector to be able to pull views/tables/bases from an Airtable organization

Fixes

make notion module discoverable
fix emails with Content-Distribution: inline and Content-Distribution: attachment with no filename
Fix email attachment filenames which had = in the filename itself

unstructured - 0.9.2

Published by cragwolfe about 1 year ago

0.9.2

Enhancements

Update table extraction section in API documentation to sync with change in Prod API
Update Notion connector to extract to html
Bump unstructured-inference==0.5.9:
- better caching of models
- another version of detectron2 available, though the default layout model is unchanged
Added UUID option for element_id

Features

Adds Sharepoint connector.

Fixes

Bump unstructured-inference==0.5.9:
- ignores Tesseract errors where no text is extracted for tiles that indeed, have no text

unstructured - 0.9.1

Published by ryannikolaidis about 1 year ago

0.9.1

Enhancements

Adds --partition-pdf-infer-table-structure to unstructured-ingest.
Enable partition_html to skip headers and footers with the skip_headers_and_footers flag.
Update partition_doc and partition_docx to track emphasized texts in the output
Adds post processing function filter_element_types
Set the default strategy for partitioning images to hi_res
Add page break parameter section in API documentation to sync with change in Prod API
Update partition_html to track emphasized texts in the output
Update XMLDocument._read_xml to create <p> tag element for the text enclosed in the <pre> tag
Add parameter include_tail_text to _construct_text to enable (skip) tail text inclusion
Add Notion connector

Features

Fixes

Remove unused _partition_via_api function
Fixed emoji bug in partition_xlsx.
Pass file_filename metadata when partitioning file object
Skip ingest test on missing Slack token
Add Dropbox variables to CI environments
Remove default encoding for ingest
Adds new element type EmailAddress for recognizing email address in the text
Simplifies min_partition logic; makes partitions falling below the min_partition
less likely.
Fix bug where ingest test check for number of files fails in smoke test
Fix unstructured-ingest entrypoint failure

unstructured - 0.9.0

Published by MthwRobinson about 1 year ago

0.9.0

Enhancements

Dependencies are now split by document type, creating a slimmer base installation.

unstructured - 0.8.8

Published by cragwolfe about 1 year ago

0.8.8

Enhancements

Features

Fixes

Rename "date" field to "last_modified"
Adds Box connector

unstructured - 0.8.7

Published by yuming-long about 1 year ago

0.8.7

Enhancements

Put back useful function split_by_paragraph

Features

Fixes

Fix argument order in NLTK download step

unstructured - 0.8.6

Published by cragwolfe about 1 year ago

0.8.6

Enhancements

Features

Fixes

Remove debug print lines and non-functional code

unstructured - 0.8.5

Published by yuming-long about 1 year ago

0.8.5

Enhancements

Add parameter skip_infer_table_types to enable (skip) table extraction for other doc types
Adds optional Unstructured API unit tests in CI
Tracks last modified date for all document types.