Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
APACHE-2.0 License
Bot releases are visible (Hide)
Published by MthwRobinson over 1 year ago
--s3-url
in favor of --remote-url
in CLIfile_directory
to metadatapage_name
to metadata. Currently used for the sheet name in XLSX documents.--partition-strategy
parameter to unstructured-ingest so that users can specify--partition-strategy fast
.unstructured/file-utils/filetype.py
to better utilise hashmap to return mime type.test_filetype.py
.partition_xml
for XML files.partition_xlsx
for Microsoft Excel documents.hml
filetype for partition as a variation of html filetype.pytesseract
a function level import in partition_pdf
so you can use the "fast"
"hi_res"
strategies if pytesseract
is not installed. Also adds therequired_dependencies
decorator for the "hi_res"
and "ocr_only"
strategies.filename
is tracked in metadata for docx
tables.Published by MthwRobinson over 1 year ago
"auto"
strategy that chooses the partitioning strategy based on documentpartition_pdf
partition_image
. Users can maintain existing behavior by explicitly settingstrategy="hi_res"
.get_date
method to ElementMetadata
for converting the datestring to a datetime
object.filename
attribute on ElementMetadata
to remove the full filepath.partition_docx
in docxfileutils/file_type
check json and eml decode ignore errorpartition_email
was updated to more flexibly handle deviations from the RFC-2822 standard.None
if the time does not match RFC-2822 at all.Published by ryannikolaidis over 1 year ago
Published by MthwRobinson over 1 year ago
partition_pdf
. Refactored the strategy decisionPublished by MthwRobinson over 1 year ago
partition_image
.partition_multiple_via_api
for partitioning multiple documents in a single RESTstage_for_baseplate
function to prepare outputs for ingestion into Baseplate.partition_odt
for processing Open Office documents.partition_pdf
fast strategy to group together textPublished by MthwRobinson over 1 year ago
partition_pdf
for detecting copy protected PDFs and falling backpartition_via_api
for partitioning documents through the hosted API.exceeds_cap_ratio
handles empty (returns True
instead of False
)detect_filetype
to properly detect JSONs when the MIME type is text/plain
.Published by qued over 1 year ago
Published by qued over 1 year ago
ssl_verify
kwarg to partition
and partition_html
to enable turning offpartition_pdf
and partition_image
throughocr_language
kwarg. ocr_language
corresponds to the code for the language packpartition
and partition_pdf
..msg
filesPublished by MthwRobinson over 1 year ago
partition
when url
is used.bytes_string_to_string
cleaning brick for bytes string output.exactly_one
in partition_json
None
in _read_xml
._read_xml
so that Markdown files with embedded HTML process correctly.partition_pdf
and partition_text
group broken paragraphs to avoid fragmented NarrativeText
elements.Published by MthwRobinson over 1 year ago
partition_text
to group together broken paragraphs.partition_rtf
for processing rich text files.partition
now accepts a url
kwarg in addition to file
and filename
.replace_mime_encodings
.Published by MthwRobinson over 1 year ago
Published by ryannikolaidis over 1 year ago
--download-only
parameter to unstructured-ingest
Published by amanda103 over 1 year ago
split_by_paragraph
for partition_text
Published by MthwRobinson over 1 year ago
elements_to_json
to return string when filename is not specifiedelements_from_json
may take a string instead of a filename with the text
kwargdetect_filetype
now does a final fallback to file extension.unstructured-ingest
--max-docs
parameter to unstructured-ingest
partition_msg
for processing MSFT Outlook .msg files.convert_file_to_text
now passes through the source_format
and target_format
kwargs.text
kwarg no longer raise an error if an emptypartition_json
no longer fails if the input is an empty list.chunk_by_attention_window
that caused the last word in segments to be cut-offstage_for_transformers
now returns a list of elements, making it consistent with otherPublished by amanda103 over 1 year ago
exactly_one
content_type
and file_filename
parameters to partition()
to bypass file detection--flatten-metadata
parameter to unstructured-ingest
--fields-include
parameter to unstructured-ingest
Published by cragwolfe over 1 year ago
contains_english_word()
, used heavily in text processing, is 10x faster.--metadata-include
and --metadata-exclude
parameters to unstructured-ingest
clean_non_ascii_chars
to remove non-ascii characters from unicode stringpartition_pdf(..., strategy="fast")
Published by MthwRobinson over 1 year ago
FsspecConnector
to easily integrate any existing fsspec
filesystem as a connector.s3_connector.py
to s3.py
for readability and consistency with theS3Connector
relies on s3fs
instead of on boto3
, and it inheritsFsspecConnector
.UNSTRUCTURED_LANGUAGE_CHECKS
environment variable to control whether or not language"true"
for higher"false"
for faster processing.detect_filetype
warning to include filename when provided.unstructured-ingest --s3-url
option, to be deprecated in--remote-url
.AzureBlobStorageConnector
based on its fsspec
implementation inheritingFsspecConnector
partition_epub
for partitioning e-books in EPUB3 format.message/rfc822
MIME type.Published by qued over 1 year ago
auto.partition()
can now load Unstructured ISD json documents.--wikipedia-auto-suggest
argument to the ingest CLI to disable automatic redirectionencoding
argument to the partition_(text/email/html)
functions.Published by MthwRobinson over 1 year ago
unstructured-ingest
now uses a default --download_dir
of $HOME/.cache/unstructured/ingest
setup_ubuntu.sh
no longer fails in some contexts by interpretingDEBIAN_FRONTEND=noninteractive
as a commandunstructured-ingest
no longer re-downloads files when --preserve-downloadsPublished by cragwolfe over 1 year ago
partition_html
sometimes.requires_dependencies
decorator, including the error messageunstructured-ingest --github-url ...
.