unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

APACHE-2.0 License

Downloads
1.6M
Stars
5.8K
Committers
110

Bot releases are visible (Hide)

unstructured - 0.8.4

Published by Klaijan about 1 year ago

0.8.4

Enhancements

  • Additional tests and refactor of JSON detection.
  • Update functionality to retrieve image metadata from a page for document_to_element_list
  • Links are now tracked in partition_html output.
  • Set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add min_partition kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.
  • set the file's current position to the beginning after reading the file in convert_to_bytes
  • Add slide notes to pptx
  • Add --encoding directive to ingest
  • Improve json detection by detect_filetype

Features

  • Adds Outlook connector
  • Add support for dpi parameter in inference library
  • Adds Onedrive connector.
  • Add Confluence connector for ingest cli to pull the body text from all documents from all spaces in a confluence domain.

Fixes

  • Fixes issue with email partitioning where From field was being assigned the To field value.
  • Use the image_metadata property of the PageLayout instance to get the page image info in the document_to_element_list
  • Add functionality to write images to computer storage temporarily instead of keeping them in memory for ocr_only strategy
  • Add functionality to convert a PDF in small chunks of pages at a time for ocr_only strategy
  • Adds .txt, .text, and .tab to list of extensions to check if file
    has a text/plain MIME type.
  • Enables filters to be passed to partition_doc so it doesn't error with LibreOffice7.
  • Removed old error message that's superseded by requires_dependencies.
  • Removes using hi_res as the default strategy value for partition_via_api and partition_multiple_via_api
unstructured -

Published by rbiseck3 over 1 year ago

0.8.1

Enhancements

  • Add support for Python 3.11

Features

Fixes

  • Fixed auto strategy detected scanned document as having extractable text and using fast strategy, resulting in no output.
  • Fix list detection in MS Word documents.
  • Don't instantiate an element with a coordinate system when there isn't a way to get its location data.
unstructured - 0.8.0

Published by rbiseck3 over 1 year ago

Enhancements

  • Allow model used for hi res pdf partition strategy to be chosen when called.
  • Updated inference package

Features

  • Add metadata_filename parameter across all partition functions

Fixes

  • Adjust encoding recognition threshold value in detect_file_encoding

  • Fix KeyError when isd_to_elements doesn't find a type

  • Fix _output_filename for local connector, allowing single files to be written correctly to the disk

  • Fix for cases where an invalid encoding is extracted from an email header.

BREAKING CHANGES

  • Information about an element's location is no longer returned as top-level attributes of an element. Instead, it is returned in the coordinates attribute of the element's metadata.
unstructured - 0.7.12

Published by tabossert over 1 year ago

0.7.12

Enhancements

  • Adds include_metadata kwarg to partition_doc, partition_docx, partition_email, partition_epub, partition_json, partition_msg, partition_odt, partition_org, partition_pdf, partition_ppt, partition_pptx, partition_rst, and partition_rtf

Features

  • Adds Dropbox connector

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally
unstructured - 0.7.11

Published by cragwolfe over 1 year ago

0.7.11

Enhancements

  • More deterministic element ordering when using hi_res PDF parsing strategy (from unstructured-inference bump to 0.5.4)
  • Make large model available (from unstructured-inference bump to 0.5.3)
  • Combine inferred elements with extracted elements (from unstructured-inference bump to 0.5.2)
  • partition_email and partition_msg will now process attachments if process_attachments=True
    and a attachment partitioning functions is passed through with attachment_partitioner=partition.

Features

Fixes

  • Fix tests that call unstructured-api by passing through an api-key
  • Fixed page breaks being given (incorrect) page numbers
  • Fix skipping download on ingest when a source document exists locally
unstructured - 0.7.10

Published by MthwRobinson over 1 year ago

0.7.10

Enhancements

  • Adds a max_partition parameter to partition_text, partition_pdf, partition_email,
    partition_msg and partition_xml that sets a limit for the size of an individual
    document elements. Defaults to 1500 for everything except partition_xml, which has
    a default value of None.
  • DRY connector refactor

Features

  • hi_res model for pdfs and images is selectable via environment variable.

Fixes

  • CSV check now ignores escaped commas.
  • Fix for filetype exploration util when file content does not have a comma.
  • Adds negative lookahead to bullet pattern to avoid detecting plain text line
    breaks like ------- as list items.
  • Fix pre tag parsing for partition_html
  • Fix lookup error for annotated Arabic and Hebrew encodings
unstructured - 0.7.9

Published by cragwolfe over 1 year ago

0.7.9

Enhancements

  • Improvements to string check for leafs in partition_xml.
  • Adds --partition-ocr-languages to unstructured-ingest.

Features

  • Adds partition_org for processed Org Mode documents.

Fixes

unstructured - 0.7.8

Published by cragwolfe over 1 year ago

0.7.8

Enhancements

Features

  • Adds Google Cloud Service connector

Fixes

  • Updates the parse_email for partition_eml so that unstructured-api passes the smoke tests
  • partition_email now works if there is no message content
  • Updates the "fast" strategy for partition_pdf so that it's able to recursively
  • Adds recursive functionality to all fsspec connectors
  • Adds generic --recursive ingest flag
unstructured - 0.7.7

Published by MthwRobinson over 1 year ago

0.7.7

Enhancements

  • Adds functionality to replace the MIME encodings for eml files with one of the common encodings if a unicode error occurs
  • Adds missed file-like object handling in detect_file_encoding
  • Adds functionality to extract charset info from eml files

Features

  • Added coordinate system class to track coordinate types and convert to different coordinate

Fixes

  • Adds an html_assemble_articles kwarg to partition_html to enable users to capture
    control whether content outside of <article> tags is captured when
    <article> tags are present.
  • Check for the xml attribute on element before looking for pagebreaks in partition_docx.
unstructured - 0.7.6

Published by yuming-long over 1 year ago

0.7.6

Enhancements

  • Convert fast startegy to ocr_only for images
  • Adds support for page numbers in .docx and .doc when user or renderer
    created page breaks are present.
  • Adds retry logic for the unstructured-ingest Biomed connector

Features

  • Provides users with the ability to extract additional metadata via regex.
  • Updates partition_docx to include headers and footers in the output.
  • Create partition_tsv and associated tests. Make additional changes to detect_filetype.

Fixes

  • Remove fake api key in test partition_via_api since we now require valid/empty api keys
  • Page number defaults to None instead of 1 when page number is not present in the metadata.
    A page number of None indicates that page numbers are not being tracked for the document
    or that page numbers do not apply to the element in question..
  • Fixes an issue with some pptx files. Assume pptx shapes are found in top left position of slide
    in case the shape.top and shape.left attributes are None.
unstructured - 0.7.5

Published by cragwolfe over 1 year ago

0.7.5

Enhancements

  • Adds functionality to sort elements in partition_pdf for fast strategy
  • Adds ingest tests with --fast strategy on PDF documents
  • Adds --api-key to unstructured-ingest

Features

  • Adds partition_rst for processed ReStructured Text documents.

Fixes

  • Adds handling for emails that do not have a datetime to extract.
  • Adds pdf2image package as core requirement of unstructured (with no extras)
unstructured - 0.7.4

Published by yuming-long over 1 year ago

0.7.4

Enhancements

  • Allows passing kwargs to request data field for partition_via_api and partition_multiple_via_api
  • Enable MIME type detection if libmagic is not available
  • Adds handling for empty files in detect_filetype and partition.

Features

Fixes

  • Reslove grpcio import issue on weaviate.schema.validate_schema for python 3.9 and 3.10
  • Remove building detectron2 from source in Dockerfile
unstructured - 0.7.3

Published by yuming-long over 1 year ago

0.7.3

Enhancements

  • Update IngestDoc abstractions and add data source metadata in ElementMetadata

Features

Fixes

  • Pass strategy parameter down from partition for partition_image
  • Filetype detection if a CSV has a text/plain MIME type
  • convert_office_doc no longers prints file conversion info messages to stdout.
  • partition_via_api reflects the actual filetype for the file processed in the API.
unstructured - 0.7.2

Published by MthwRobinson over 1 year ago

0.7.2

Enhancements

  • Adds an optional encoding kwarg to elements_to_json and elements_from_json
  • Bump version of base image to use new stable version of tesseract

Features

Fixes

  • Update the read_txt_file utility function to keep using spooled_to_bytes_io_if_needed for xml
  • Add functionality to the read_txt_file utility function to handle file-like object from URL
  • Remove the unused parameter encoding from partition_pdf
  • Change auto.py to have a None default for encoding
  • Add functionality to try other common encodings for html and xml files if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds benchmark test with test docs in example-docs
  • Re-enable test_upload_label_studio_data_with_sdk
  • File detection now detects code files as plain text
  • Adds tabulate explicitly to dependencies
  • Fixes an issue in metadata.page_number of pptx files
  • Adds showing help if no parameters passed
unstructured - 0.7.1

Published by MthwRobinson over 1 year ago

0.7.1

Enhancements

Features

  • Add stage_for_weaviate to stage unstructured outputs for upload to Weaviate, along with
    a helper function for defining a class to use in Weaviate schemas.
  • Builds from Unstructured base image, built off of Rocky Linux 8.7, this resolves almost all CVE's in the image.

Fixes

unstructured - 0.7.0

Published by MthwRobinson over 1 year ago

0.7.0

Enhancements

  • Installing detectron2 from source is no longer required when using the local-inference extra.
  • Updates .pptx parsing to include text in tables.

Features

Fixes

  • Fixes an issue in _add_element_metadata that caused all elements to have page_number=1
    in the element metadata.
  • Adds .log as a file extension for TXT files.
  • Adds functionality to try other common encodings for email (.eml) files if an error related to the encoding is raised and the user has not specified an encoding.
  • Allow passed encoding to be used in the replace_mime_encodings
  • Fixes page metadata for partition_html when include_metadata=False
  • A ValueError now raises if file_filename is not specified when you use partition_via_api
    with a file-like object.
unstructured - 0.6.11

Published by yuming-long over 1 year ago

0.6.11

Enhancements

  • Supports epub tests since pandoc is updated in base image

Features

Fixes

unstructured - 0.6.10

Published by cragwolfe over 1 year ago

0.6.10

Enhancements

  • XLS support from auto partition

Features

Fixes

unstructured - 0.6.9

Published by qued over 1 year ago

0.6.9

Enhancements

  • fast strategy for pdf now keeps element bounding box data
  • setup.py refactor

Features

Fixes

  • Adds functionality to try other common encodings if an error related to the encoding is raised and the user has not specified an encoding.
  • Adds additional MIME types for CSV
unstructured - 0.6.8

Published by MthwRobinson over 1 year ago

0.6.8

Enhancements

Features

  • Add partition_csv for CSV files.

Fixes

Package Rankings
Top 1.48% on Pypi.org
Top 3.72% on Proxy.golang.org
Badges
Extracted from project README
Downloads Downloads