unstructured | Langchain Ecosystem Directory

Bot releases are hidden (Show)

unstructured - 0.5.0

Published by MthwRobinson over 1 year ago

0.5.0

Enhancements

Add requires_dependencies Python decorator to check dependencies are installed before
instantiating a class or running a function

Features

Added Wikipedia connector for ingest cli.

Fixes

Fix process_document file cleaning on failure
Fixes an error introduced in the metadata tracking commit that caused NarrativeText
and FigureCaption elements to be represented as Text in HTML documents.

unstructured - 0.4.16

Published by cragwolfe over 1 year ago

0.4.16

Enhancements

Fallback to using file extensions for filetype detection if libmagic is not present

Features

Added setup script for Ubuntu
Added GitHub connector for ingest cli.
Added partition_md partitioner.
Added Reddit connector for ingest cli.

Fixes

Initializes connector properly in ingest.main::MainProcess
Restricts version of unstructured-inference to avoid multithreading issue

unstructured - 0.4.15

Published by MthwRobinson over 1 year ago

0.4.15

Enhancements

Added elements_to_json and elements_from_json for easier serialization/deserialization
convert_to_dict, dict_to_elements and convert_to_csv are now aliases for functions
that use the ISD terminology.

Fixes

Update to ensure all elements are preserved during serialization/deserialization

unstructured - 0.4.14

Published by MthwRobinson over 1 year ago

0.4.14

Automatically install nltk models in the tokenize module.

unstructured - 0.4.13

Published by cragwolfe over 1 year ago

0.4.13

Fixes unstructured-ingest cli.

unstructured - 0.4.12

Published by cragwolfe over 1 year ago

0.4.12

Adds console_entrypoint for unstructured-ingest, other structure/doc updates related to ingest.
Add parser parameter to partition_html.

unstructured - 0.4.11

Published by MthwRobinson over 1 year ago

0.4.11

Adds partition_doc for partitioning Word documents in .doc format. Requires libreoffice.
Adds partition_ppt for partitioning PowerPoint documents in .ppt format. Requires libreoffice.

unstructured - 0.4.10

Published by MthwRobinson over 1 year ago

0.4.10

Fixes ElementMetadata so that it's JSON serializable when the filename is a Path object.

unstructured - 0.4.9

Published by MthwRobinson over 1 year ago

0.4.9

Added ingest modules and s3 connector
Default to url=None for partition_pdf and partition_image
Add ability to skip English specific check by setting the UNSTRUCTURED_LANGUAGE env var to "".
Document Element objects now track metadata

unstructured - 0.4.8

Published by MthwRobinson over 1 year ago

0.4.8

Modified XML and HTML parsers not to load comments.

unstructured - 0.4.7

Published by MthwRobinson over 1 year ago

Added the ability to pull an HTML document from a url in partition_html.
Added the the ability to get file summary info from lists of filenames and lists
of file contents.
Added optional page break to partition for .pptx, .pdf, images, and .html files.
Added to_dict method to document elements.
Include more unicode quotes in replace_unicode_quotes.

unstructured - 0.4.6

Published by MthwRobinson over 1 year ago

0.4.6

Loosen the default cap threshold to 0.5.
Add a UNSTRUCTURED_NARRATIVE_TEXT_CAP_THRESHOLD environment variable for controlling
the cap ratio threshold.
Unknown text elements are identified as Text for HTML and plain text documents.
Body Text styles no longer default to NarrativeText for Word documents. The style information
is insufficient to determine that the text is narrative.
Upper cased text is lower cased before checking for verbs. This helps avoid some missed verbs.
Adds an Address element for capturing elements that only contain an address.
Suppress the UserWarning when detectron is called.
Checks that titles and narrative test have at least one English word.
Checks that titles and narrative text are at least 50% alpha characters.
Restricts titles to a maximum word length. Adds a UNSTRUCTURED_TITLE_MAX_WORD_LENGTH
environment variable for controlling the max number of words in a title.
Updated partition_pptx to order the elements on the page

unstructured - 0.4.4

Published by MthwRobinson over 1 year ago

0.4.4

Updated partition_pdf and partition_image to return unstructured Element objects
Fixed the healthcheck url path when partitioning images and PDFs via API
Adds an optional coordinates attribute to document objects
Adds FigureCaption and CheckBox document elements
Added ability to split lists detected in LayoutElement objects
Adds partition_pptx for partitioning PowerPoint documents
LayoutParser models now download from HugginfaceHub instead of DropBox
Fixed file type detection for XML and HTML files on Amazone Linux

unstructured - 0.4.3

Published by MthwRobinson almost 2 years ago

0.4.3

Adds requests as a base dependency
Fix in exceeds_cap_ratio so the function doesn't break with empty text
Fix bug in _parse_received_data.
Update detect_filetype to properly handle .doc, .xls, and .ppt.

unstructured - 0.4.2

Published by MthwRobinson almost 2 years ago

0.4.2

Added partition_image to process documents in an image format.
Fixed utf-8 encoding error in partition_email with attachments for text/html

unstructured - 0.4.1

Published by MthwRobinson almost 2 years ago

0.4.1

Added support for text files in the partition function
Pinned opencv-python for easier installation on Linux

unstructured - 0.4.0

Published by MthwRobinson almost 2 years ago

0.4.0

Added generic partition brick that detects the file type and routes a file to the appropriate
partitioning brick.
Added a file type detection module.
Updated partition_html and partition_eml to support file-like objects in 'rb' mode.
Cleaning brick for removing ordered bullets clean_ordered_bullets.
Extract brick method for ordered bullets extract_ordered_bullets.
Test for clean_ordered_bullets.
Test for extract_ordered_bullets.
Added partition_docx for pre-processing Word Documents.
Added new REGEX patterns to extract email header information
Added new functions to extract header information parse_received_data and partition_header
Added new function to parse plain text files partition_text
Added new cleaners functions extract_ip_address, extract_ip_address_name, extract_mapi_id, extract_datetimetz
Add new Image element and function to find embedded images find_embedded_images
Added get_directory_file_info for summarizing information about source documents

unstructured - 0.3.5

Published by qued almost 2 years ago

0.3.5

Add support for local inference
Add new pattern to recognize plain text dash bullets
Add test for bullet patterns
Fix for partition_html that allows for processing div tags that have both text and child elements
Add ability to extract document metadata from .docx, .xlsx, and .jpg files.
Helper functions for identifying and extracting phone numbers
Add new function extract_attachment_info that extracts and decode the attachment of an email.
Staging brick to convert a list of Elements to a pandas dataframe.