unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

APACHE-2.0 License

Downloads
1.6M
Stars
5.8K
Committers
110

Bot releases are visible (Hide)

unstructured - 0.10.12

Published by cragwolfe about 1 year ago

0.10.12

Enhancements

  • Removed PIL pin as issue has been resolved upstream
  • Bump unstructured-inference
    • Support for yolox_quantized layout detection model (0.5.20)
  • YoloX element types added

Features

  • Add Salesforce Connector to be able to pull Account, Case, Campaign, EmailMessage, Lead

Fixes

  • Bump unstructured-inference
    • Avoid divide-by-zero errors swith safe_division (0.5.21)
unstructured - 0.10.11

Published by cragwolfe about 1 year ago

0.10.11

Enhancements

  • Bump unstructured-inference
    • Combine entire-page OCR output with layout-detected elements, to ensure full coverage of the page (0.5.19)

Features

  • Add in ingest cli s3 writer

Fixes

  • Fix a bug where xy-cut sorting attemps to sort elements without valid coordinates; now xy cut sorting only works when all elements have valid coordinates
unstructured - 0.10.10

Published by cragwolfe about 1 year ago

0.10.10

Enhancements

  • Adds text as an input parameter to partition_xml.
  • partition_xml no longer runs through partition_text, avoiding incorrect splitting
    on carriage returns in the XML. Since partition_xml no longer calls partition_text,
    min_partition and max_partition are no longer supported in partition_xml.
  • Bump unstructured-inference==0.5.18, change non-default detectron2 classification threshold
  • Upgrade base image from rockylinux 8 to rockylinux 9
  • Serialize IngestDocs to JSON when passing to subprocesses

Features

Fixes

  • Fix a bug where mismatched elements and bboxes are passed into add_pytesseract_bbox_to_elements
unstructured - 0.10.9

Published by cragwolfe about 1 year ago

0.10.9

Enhancements

  • Fix test_json to handle only non-extra dependencies file types (plain-text)

Features

  • Adds chunk_by_title to break a document into sections based on the presence of Title
    elements.

Fixes

  • Make cv2 dependency optional
  • Edit add_pytesseract_bbox_to_elements's (ocr_only strategy) metadata.coordinates.points return type to Tuple for consistency.
  • Re-enable test-ingest-confluence-diff for ingest tests
  • Fix syntax for ingest test check number of files
unstructured - 0.10.8

Published by cragwolfe about 1 year ago

0.10.8

Enhancements

  • Release docker image that installs Python 3.10 rather than 3.8

Features

Fixes

unstructured - 0.10.7

Published by cragwolfe about 1 year ago

0.10.7

Enhancements

Features

Fixes

  • Remove overly aggressive ListItem chunking for images and PDF's which typically resulted in inchorent elements.
unstructured - 0.10.6

Published by cragwolfe about 1 year ago

0.10.6

Enhancements

  • Enable partition_email and partition_msg to detect if an email is PGP encryped. If
    and email is PGP encryped, the functions will return an empy list of elements and
    emit a warning about the encrypted content.
  • Add threaded Slack conversations into Slack connector output
  • Add functionality to sort elements using xy-cut sorting approach in partition_pdf for hi_res and fast strategies
  • Bump unstructured-inference
    • Set OMP_THREAD_LIMIT to 1 if not set for better tesseract perf (0.5.17)

Features

  • Extract coordinates from PDFs and images when using OCR only strategy and add to metadata

Fixes

  • Update partition_html to respect the order of <pre> tags.
  • Fix bug in partition_pdf_or_image where two partitions were called if strategy == "ocr_only".
  • Bump unstructured-inference
    • Fix issue where temporary files were being left behind (0.5.16)
  • Adds deprecation warning for the file_filename kwarg to partition, partition_via_api,
    and partition_multiple_via_api.
  • Fix documentation build workflow by pinning dependencies
unstructured - 0.10.5

Published by cragwolfe about 1 year ago

0.10.5

Enhancements

  • partition raises an error and tells the user to install the appropriate extra if a filetype
    is detected that is missing dependencies.
  • Add custom errors to ingest
  • Bump unstructured-ingest==0.5.15
    • Handle an uncaught TesseractError (0.5.15)
    • Add TIFF test file and TIFF filetype to test_from_image_file in test_layout (0.5.14)
  • Use entire_page ocr mode for pdfs and images
  • Add notes on extra installs to docs

Features

  • Add delta table connector

Fixes

unstructured - 0.10.4

Published by awalker4 about 1 year ago

0.10.4

Enhancements

  • Adds ability to reuse connections per process in unstructured-ingest
  • Pass ocr_mode in partition_pdf and set the default back to individual pages for now

Features

Fixes

unstructured - 0.10.2

Published by cragwolfe about 1 year ago

0.10.2

Enhancements

  • Bump unstructured-inference==0.5.13:
    • Fix extracted image elements being included in layout merge, addresses the issue
      where an entire-page image in a PDF was not passed to the layout model when using hi_res.

Features

Fixes

unstructured - 0.10.1

Published by cragwolfe about 1 year ago

0.10.1

Enhancements

  • Bump unstructured-inference==0.5.12:
    • fix to avoid trace for certain PDF's (0.5.12)
    • better defaults for DPI for hi_res and Chipper (0.5.11)
    • implement full-page OCR (0.5.10)

Features

Fixes

  • Fix dead links in repository README (Quick Start > Install for local development, and Learn more > Batch Processing)
  • Update document dependencies to include tesseract-lang for additional language support (required for tests to pass)
unstructured - 0.10.0

Published by cragwolfe about 1 year ago

0.10.0

Enhancements

  • Update the links and emphasized_texts metadata fields

Features

Fixes

unstructured - 0.9.3

Published by cragwolfe about 1 year ago

0.9.3

Enhancements

  • Pinned dependency cleanup.
  • Update partition_csv to always use soupparser_fromstring to parse html text
  • Update partition_tsv to always use soupparser_fromstring to parse html text
  • Add metadata.section to capture epub table of contents data
  • Add unique_element_ids kwarg to partition functions. If True, will use a UUID
    for element IDs instead of a SHA-256 hash.
  • Update partition_xlsx to always use soupparser_fromstring to parse html text
  • Add functionality to switch html text parser based on whether the html text contains emoji
  • Add functionality to check if a string contains any emoji characters

Features

  • Add Airtable Connector to be able to pull views/tables/bases from an Airtable organization

Fixes

  • make notion module discoverable
  • fix emails with Content-Distribution: inline and Content-Distribution: attachment with no filename
  • Fix email attachment filenames which had = in the filename itself
unstructured - 0.9.2

Published by cragwolfe about 1 year ago

0.9.2

Enhancements

  • Update table extraction section in API documentation to sync with change in Prod API
  • Update Notion connector to extract to html
  • Bump unstructured-inference==0.5.9:
    • better caching of models
    • another version of detectron2 available, though the default layout model is unchanged
  • Added UUID option for element_id

Features

  • Adds Sharepoint connector.

Fixes

  • Bump unstructured-inference==0.5.9:
    • ignores Tesseract errors where no text is extracted for tiles that indeed, have no text
unstructured - 0.9.1

Published by ryannikolaidis about 1 year ago

0.9.1

Enhancements

  • Adds --partition-pdf-infer-table-structure to unstructured-ingest.
  • Enable partition_html to skip headers and footers with the skip_headers_and_footers flag.
  • Update partition_doc and partition_docx to track emphasized texts in the output
  • Adds post processing function filter_element_types
  • Set the default strategy for partitioning images to hi_res
  • Add page break parameter section in API documentation to sync with change in Prod API
  • Update partition_html to track emphasized texts in the output
  • Update XMLDocument._read_xml to create <p> tag element for the text enclosed in the <pre> tag
  • Add parameter include_tail_text to _construct_text to enable (skip) tail text inclusion
  • Add Notion connector

Features

Fixes

  • Remove unused _partition_via_api function
  • Fixed emoji bug in partition_xlsx.
  • Pass file_filename metadata when partitioning file object
  • Skip ingest test on missing Slack token
  • Add Dropbox variables to CI environments
  • Remove default encoding for ingest
  • Adds new element type EmailAddress for recognizing email address in the  text
  • Simplifies min_partition logic; makes partitions falling below the min_partition
    less likely.
  • Fix bug where ingest test check for number of files fails in smoke test
  • Fix unstructured-ingest entrypoint failure
unstructured - 0.9.0

Published by MthwRobinson about 1 year ago

0.9.0

Enhancements

  • Dependencies are now split by document type, creating a slimmer base installation.
unstructured - 0.8.8

Published by cragwolfe about 1 year ago

0.8.8

Enhancements

Features

Fixes

  • Rename "date" field to "last_modified"
  • Adds Box connector
unstructured - 0.8.7

Published by yuming-long about 1 year ago

0.8.7

Enhancements

  • Put back useful function split_by_paragraph

Features

Fixes

  • Fix argument order in NLTK download step
unstructured - 0.8.6

Published by cragwolfe about 1 year ago

0.8.6

Enhancements

Features

Fixes

  • Remove debug print lines and non-functional code
unstructured - 0.8.5

Published by yuming-long about 1 year ago

0.8.5

Enhancements

  • Add parameter skip_infer_table_types to enable (skip) table extraction for other doc types
  • Adds optional Unstructured API unit tests in CI
  • Tracks last modified date for all document types.

Features

Fixes

  • NLTK now only gets downloaded if necessary.
  • Handling for empty tables in Word Documents and PowerPoints.
Package Rankings
Top 1.48% on Pypi.org
Top 3.72% on Proxy.golang.org
Badges
Extracted from project README
Downloads Downloads