unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

APACHE-2.0 License

Downloads
1.6M
Stars
5.8K
Committers
110

Bot releases are visible (Hide)

unstructured - 0.6.7

Published by MthwRobinson over 1 year ago

0.6.7

Enhancements

  • Deprecate --s3-url in favor of --remote-url in CLI
  • Refactor out non-connector-specific config variables
  • Add file_directory to metadata
  • Add page_name to metadata. Currently used for the sheet name in XLSX documents.
  • Added a --partition-strategy parameter to unstructured-ingest so that users can specify
    partition strategy in CLI. For example, --partition-strategy fast.
  • Added metadata for filetype.
  • Add Discord connector to pull messages from a list of channels
  • Refactor unstructured/file-utils/filetype.py to better utilise hashmap to return mime type.
  • Add local declaration of DOCX_MIME_TYPES and XLSX_MIME_TYPES for test_filetype.py.

Features

  • Add partition_xml for XML files.
  • Add partition_xlsx for Microsoft Excel documents.

Fixes

  • Supports hml filetype for partition as a variation of html filetype.
  • Makes pytesseract a function level import in partition_pdf so you can use the "fast"
    or "hi_res" strategies if pytesseract is not installed. Also adds the
    required_dependencies decorator for the "hi_res" and "ocr_only" strategies.
  • Fix to ensure filename is tracked in metadata for docx tables.
unstructured - 0.6.6

Published by MthwRobinson over 1 year ago

0.6.6

Enhancements

  • Adds an "auto" strategy that chooses the partitioning strategy based on document
    characteristics and function kwargs. This is the new default strategy for partition_pdf
    and partition_image. Users can maintain existing behavior by explicitly setting
    strategy="hi_res".
  • Added an additional trace logger for NLP debugging.
  • Add get_date method to ElementMetadata for converting the datestring to a datetime object.
  • Cleanup the filename attribute on ElementMetadata to remove the full filepath.

Features

  • Added table reading as html with URL parsing to partition_docx in docx
  • Added metadata field for text_as_html for docx files

Fixes

  • fileutils/file_type check json and eml decode ignore error
  • partition_email was updated to more flexibly handle deviations from the RFC-2822 standard.
    The time in the metadata returns None if the time does not match RFC-2822 at all.
  • Include all metadata fields when converting to dataframe or CSV
unstructured - 0.6.5

Published by ryannikolaidis over 1 year ago

0.6.5

Enhancements

  • Added support for SpooledTemporaryFile file argument.

Features

Fixes

unstructured - 0.6.4

Published by MthwRobinson over 1 year ago

0.6.4

Enhancements

  • Added an "ocr_only" strategy for partition_pdf. Refactored the strategy decision
    logic into its own module.

Features

Fixes

unstructured - 0.6.3

Published by MthwRobinson over 1 year ago

0.6.3

Enhancements

  • Add an "ocr_only" strategy for partition_image.

Features

  • Added partition_multiple_via_api for partitioning multiple documents in a single REST
    API call.
  • Added stage_for_baseplate function to prepare outputs for ingestion into Baseplate.
  • Added partition_odt for processing Open Office documents.

Fixes

  • Updates the grouping logic in the partition_pdf fast strategy to group together text
    in the same bounding box.
unstructured - 0.6.2

Published by MthwRobinson over 1 year ago

0.6.2

Enhancements

  • Added logic to partition_pdf for detecting copy protected PDFs and falling back
    to the hi res strategy when necessary.

Features

  • Add partition_via_api for partitioning documents through the hosted API.

Fixes

  • Fix how exceeds_cap_ratio handles empty (returns True instead of False)
  • Updates detect_filetype to properly detect JSONs when the MIME type is text/plain.
unstructured - 0.6.1

Published by qued over 1 year ago

0.6.1

Enhancements

  • Updated the table extraction parameter name to be more descriptive

Features

Fixes

unstructured - 0.6.0

Published by qued over 1 year ago

0.6.0

Enhancements

  • Adds an ssl_verify kwarg to partition and partition_html to enable turning off
    SSL verification for HTTP requests. SSL verification is on by default.
  • Allows users to pass in ocr language to partition_pdf and partition_image through
    the ocr_language kwarg. ocr_language corresponds to the code for the language pack
    in Tesseract. You will need to install the relevant Tesseract language pack to use a
    given language.

Features

  • Table extraction is now possible for pdfs from partition and partition_pdf.
  • Adds support for extracting attachments from .msg files

Fixes

unstructured - 0.5.13

Published by MthwRobinson over 1 year ago

0.5.13

Enhancements

  • Allow headers to be passed into partition when url is used.

Features

  • bytes_string_to_string cleaning brick for bytes string output.

Fixes

  • Fixed typo in call to exactly_one in partition_json
  • unstructured-documents encode xml string if document_tree is None in _read_xml.
  • Update to _read_xml so that Markdown files with embedded HTML process correctly.
  • Fallback to "fast" strategy only emits a warning if the user specifies the "hi_res" strategy.
  • unstructured-partition-text_type exceeds_cap_ratio fix returns and how capitalization ratios are calculated
  • partition_pdf and partition_text group broken paragraphs to avoid fragmented NarrativeText elements.
  • .json files resolved as "application/json" on centos7 (or other installs with older libmagic libs)
unstructured - 0.5.12

Published by MthwRobinson over 1 year ago

0.5.12

Enhancements

  • Add OS mimetypes DB to docker image, mainly for unstructured-api compat.
  • Use the image registry as a cache when building Docker images.
  • Adds the ability for partition_text to group together broken paragraphs.

Features

  • Add --partition-by-api parameter to unstructured-ingest
  • Added partition_rtf for processing rich text files.
  • partition now accepts a url kwarg in addition to file and filename.

Fixes

  • Allow encoding to be passed into replace_mime_encodings.
  • unstructured-ingest connector-specific dependencies are imported on demand.
  • unstructured-ingest --flatten-metadata supported for local connector.
  • unstructured-ingest fix runtime error when using --metadata-include.
unstructured - 0.5.11

Published by MthwRobinson over 1 year ago

0.5.11

Enhancements

Features

Fixes

  • Guard against null style attribute in docx document elements
  • Update HTML encoding to better support foreign language characters
unstructured - 0.5.10

Published by ryannikolaidis over 1 year ago

0.5.10

Enhancements

  • Updated inference package
  • Add sender, recipient, date, and subject to element metadata for emails

Features

  • Added --download-only parameter to unstructured-ingest

Fixes

  • FileNotFound error when filename is provided but file is not on disk
unstructured - 0.5.9

Published by amanda103 over 1 year ago

0.5.9

Enhancements

Features

Fixes

  • Convert file to str in helper split_by_paragraph for partition_text
unstructured - 0.5.8

Published by MthwRobinson over 1 year ago

0.5.8

Enhancements

  • Update elements_to_json to return string when filename is not specified
  • elements_from_json may take a string instead of a filename with the text kwarg
  • detect_filetype now does a final fallback to file extension.
  • Empty tags are now skipped during the depth check for HTML processing.

Features

  • Add local file system to unstructured-ingest
  • Add --max-docs parameter to unstructured-ingest
  • Added partition_msg for processing MSFT Outlook .msg files.

Fixes

  • convert_file_to_text now passes through the source_format and target_format kwargs.
    Previously they were hard coded.
  • Partitioning functions that accept a text kwarg no longer raise an error if an empty
    string is passed (and empty list of elements is returned instead).
  • partition_json no longer fails if the input is an empty list.
  • Fixed bug in chunk_by_attention_window that caused the last word in segments to be cut-off
    in some cases.

BREAKING CHANGES

  • stage_for_transformers now returns a list of elements, making it consistent with other
    staging bricks
unstructured - 0.5.7

Published by amanda103 over 1 year ago

0.5.7

Enhancements

  • Refactored codebase using exactly_one
  • Adds ability to pass headers when passing a url in partition_html()
  • Added optional content_type and file_filename parameters to partition() to bypass file detection

Features

  • Add --flatten-metadata parameter to unstructured-ingest
  • Add --fields-include parameter to unstructured-ingest

Fixes

unstructured - 0.5.6

Published by cragwolfe over 1 year ago

0.5.6

  • Fix problem with PDF partition (duplicated test)

Enhancements

  • contains_english_word(), used heavily in text processing, is 10x faster.

Features

  • Add --metadata-include and --metadata-exclude parameters to unstructured-ingest
  • Add clean_non_ascii_chars to remove non-ascii characters from unicode string

Fixes

  • Fixes duplicated elements issue with partition_pdf(..., strategy="fast")
unstructured - 0.5.4

Published by MthwRobinson over 1 year ago

0.5.4

Enhancements

  • Added Biomedical literature connector for ingest cli.
  • Add FsspecConnector to easily integrate any existing fsspec filesystem as a connector.
  • Rename s3_connector.py to s3.py for readability and consistency with the
    rest of the connectors.
  • Now S3Connector relies on s3fs instead of on boto3, and it inherits
    from FsspecConnector.
  • Adds an UNSTRUCTURED_LANGUAGE_CHECKS environment variable to control whether or not language
    specific checks like vocabulary and POS tagging are applied. Set to "true" for higher
    resolution partitioning and "false" for faster processing.
  • Improves detect_filetype warning to include filename when provided.
  • Adds a "fast" strategy for partitioning PDFs with PDFMiner. Also falls back to the "fast"
    strategy if detectron2 is not available.
  • Start deprecation life cycle for unstructured-ingest --s3-url option, to be deprecated in
    favor of --remote-url.

Features

  • Add AzureBlobStorageConnector based on its fsspec implementation inheriting
    from FsspecConnector
  • Add partition_epub for partitioning e-books in EPUB3 format.

Fixes

  • Fixes processing for text files with message/rfc822 MIME type.
  • Open xml files in read-only mode when reading contents to construct an XMLDocument.
unstructured - 0.5.3

Published by qued over 1 year ago

0.5.3

Enhancements

  • auto.partition() can now load Unstructured ISD json documents.
  • Simplify partitioning functions.
  • Improve logging for ingest CLI.

Features

  • Add --wikipedia-auto-suggest argument to the ingest CLI to disable automatic redirection
    to pages with similar names.
  • Add setup script for Amazon Linux 2
  • Add optional encoding argument to the partition_(text/email/html) functions.
  • Added Google Drive connector for ingest cli.
  • Added Gitlab connector for ingest cli.

Fixes

unstructured - 0.5.2

Published by MthwRobinson over 1 year ago

0.5.2

Enhancements

  • unstructured-ingest now uses a default --download_dir of $HOME/.cache/unstructured/ingest
    rather than a "tmp-ingest-" dir in the working directory.

Features

Fixes

  • setup_ubuntu.sh no longer fails in some contexts by interpreting
    DEBIAN_FRONTEND=noninteractive as a command
  • unstructured-ingest no longer re-downloads files when --preserve-downloads
    is used without --download-dir.
  • Fixed an issue that was causing text to be skipped in some HTML documents.
unstructured - 0.5.1

Published by cragwolfe over 1 year ago

0.5.1

Enhancements

Features

Fixes

  • Fixes an error causing JavaScript to appear in the output of partition_html sometimes.
  • Fix several issues with the requires_dependencies decorator, including the error message
    and how it was used, which had caused an error for unstructured-ingest --github-url ....
Package Rankings
Top 1.48% on Pypi.org
Top 3.72% on Proxy.golang.org
Badges
Extracted from project README
Downloads Downloads