unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

APACHE-2.0 License

Downloads
1.6M
Stars
5.8K
Committers
110

Bot releases are hidden (Show)

unstructured - 0.5.0

Published by MthwRobinson over 1 year ago

0.5.0

Enhancements

  • Add requires_dependencies Python decorator to check dependencies are installed before
    instantiating a class or running a function

Features

  • Added Wikipedia connector for ingest cli.

Fixes

  • Fix process_document file cleaning on failure
  • Fixes an error introduced in the metadata tracking commit that caused NarrativeText
    and FigureCaption elements to be represented as Text in HTML documents.
unstructured - 0.4.16

Published by cragwolfe over 1 year ago

0.4.16

Enhancements

  • Fallback to using file extensions for filetype detection if libmagic is not present

Features

  • Added setup script for Ubuntu
  • Added GitHub connector for ingest cli.
  • Added partition_md partitioner.
  • Added Reddit connector for ingest cli.

Fixes

  • Initializes connector properly in ingest.main::MainProcess
  • Restricts version of unstructured-inference to avoid multithreading issue
unstructured - 0.4.15

Published by MthwRobinson over 1 year ago

0.4.15

Enhancements

  • Added elements_to_json and elements_from_json for easier serialization/deserialization
  • convert_to_dict, dict_to_elements and convert_to_csv are now aliases for functions
    that use the ISD terminology.

Fixes

  • Update to ensure all elements are preserved during serialization/deserialization
unstructured - 0.4.14

Published by MthwRobinson over 1 year ago

0.4.14

  • Automatically install nltk models in the tokenize module.
unstructured - 0.4.13

Published by cragwolfe over 1 year ago

0.4.13

  • Fixes unstructured-ingest cli.
unstructured - 0.4.12

Published by cragwolfe over 1 year ago

0.4.12

  • Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.
  • Add parser parameter to partition_html.
unstructured - 0.4.11

Published by MthwRobinson over 1 year ago

0.4.11

  • Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.
  • Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.
unstructured - 0.4.10

Published by MthwRobinson over 1 year ago

0.4.10

  • Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.
unstructured - 0.4.9

Published by MthwRobinson over 1 year ago

0.4.9

  • Added ingest modules and s3 connector
  • Default to url=None for partition_pdf and partition_image
  • Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".
  • Document Element objects now track metadata
unstructured - 0.4.8

Published by MthwRobinson over 1 year ago

0.4.8

  • Modified XML and HTML parsers not to load comments.
unstructured - 0.4.7

Published by MthwRobinson over 1 year ago

  • Added the ability to pull an HTML document from a url in partition_html.
  • Added the the ability to get file summary info from lists of filenames and lists
    of file contents.
  • Added optional page break to partition for .pptx, .pdf, images, and .html files.
  • Added to_dict method to document elements.
  • Include more unicode quotes in replace_unicode_quotes.
unstructured - 0.4.6

Published by MthwRobinson over 1 year ago

0.4.6

  • Loosen the default cap threshold to 0.5.
  • Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
    the cap ratio threshold.
  • Unknown text elements are identified as Text for HTML and plain text documents.
  • Body Text styles no longer default to NarrativeText for Word documents. The style information
    is insufficient to determine that the text is narrative.
  • Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
  • Adds an Address element for capturing elements that only contain an address.
  • Suppress the UserWarning when detectron is called.
  • Checks that titles and narrative test have at least one English word.
  • Checks that titles and narrative text are at least 50% alpha characters.
  • Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
    environment variable for controlling the max number of words in a title.
  • Updated partition_pptx to order the elements on the page
unstructured - 0.4.4

Published by MthwRobinson over 1 year ago

0.4.4

  • Updated partition_pdf and partition_image to return unstructured Element objects
  • Fixed the healthcheck url path when partitioning images and PDFs via API
  • Adds an optional coordinates attribute to document objects
  • Adds FigureCaption and CheckBox document elements
  • Added ability to split lists detected in LayoutElement objects
  • Adds partition_pptx for partitioning PowerPoint documents
  • LayoutParser models now download from HugginfaceHub instead of DropBox
  • Fixed file type detection for XML and HTML files on Amazone Linux
unstructured - 0.4.3

Published by MthwRobinson almost 2 years ago

0.4.3

  • Adds requests as a base dependency
  • Fix in exceeds_cap_ratio so the function doesn't break with empty text
  • Fix bug in _parse_received_data.
  • Update detect_filetype to properly handle .doc, .xls, and .ppt.
unstructured - 0.4.2

Published by MthwRobinson almost 2 years ago

0.4.2

  • Added partition_image to process documents in an image format.
  • Fixed utf-8 encoding error in partition_email with attachments for text/html
unstructured - 0.4.1

Published by MthwRobinson almost 2 years ago

0.4.1

  • Added support for text files in the partition function
  • Pinned opencv-python for easier installation on Linux
unstructured - 0.4.0

Published by MthwRobinson almost 2 years ago

0.4.0

  • Added generic partition brick that detects the file type and routes a file to the appropriate
    partitioning brick.
  • Added a file type detection module.
  • Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
  • Cleaning brick for removing ordered bullets clean_ordered_bullets.
  • Extract brick method for ordered bullets extract_ordered_bullets.
  • Test for clean_ordered_bullets.
  • Test for extract_ordered_bullets.
  • Added partition_docx for pre-processing Word Documents.
  • Added new REGEX patterns to extract email header information
  • Added new functions to extract header information parse_received_data and partition_header
  • Added new function to parse plain text files partition_text
  • Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
  • Add new Image element and function to find embedded images find_embedded_images
  • Added get_directory_file_info for summarizing information about source documents
unstructured - 0.3.5

Published by qued almost 2 years ago

0.3.5

  • Add support for local inference
  • Add new pattern to recognize plain text dash bullets
  • Add test for bullet patterns
  • Fix for partition_html that allows for processing div tags that have both text and child elements
  • Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
  • Helper functions for identifying and extracting phone numbers
  • Add new function extract_attachment_info that extracts and decode the attachment of an email.
  • Staging brick to convert a list of Elements to a pandas dataframe.
unstructured - 0.3.4

Published by MthwRobinson almost 2 years ago

0.3.4

  • Python-3.7 compat
unstructured - 0.3.3

Published by yuming-long almost 2 years ago

0.3.3

  • Removes BasicConfig from logger configuration
  • Adds the partition_email partitioning brick
  • Adds the replace_mime_encodings cleaning bricks
  • Small fix to HTML parsing related to processing list items with sub-tags
Package Rankings
Top 1.48% on Pypi.org
Top 3.72% on Proxy.golang.org
Badges
Extracted from project README
Downloads Downloads