Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
APACHE-2.0 License
Bot releases are visible (Hide)
Published by Klaijan about 1 year ago
document_to_element_list
partition_html
output.convert_to_bytes
min_partition
kwarg to that combines elements below a specified threshold and modifies splitting of strings longer than max partition so words are not split.convert_to_bytes
--encoding
directive to ingestdetect_filetype
image_metadata
property of the PageLayout
instance to get the page image info in the document_to_element_list
ocr_only
strategyocr_only
strategy.txt
, .text
, and .tab
to list of extensions to check if filetext/plain
MIME type.partition_doc
so it doesn't error with LibreOffice7.requires_dependencies
.hi_res
as the default strategy value for partition_via_api
and partition_multiple_via_api
auto
strategy detected scanned document as having extractable text and using fast
strategy, resulting in no output.Published by rbiseck3 over 1 year ago
Adjust encoding recognition threshold value in detect_file_encoding
Fix KeyError when isd_to_elements
doesn't find a type
Fix _output_filename for local connector, allowing single files to be written correctly to the disk
Fix for cases where an invalid encoding is extracted from an email header.
coordinates
attribute of the element's metadata.Published by tabossert over 1 year ago
include_metadata
kwarg to partition_doc
, partition_docx
, partition_email
, partition_epub
, partition_json
, partition_msg
, partition_odt
, partition_org
, partition_pdf
, partition_ppt
, partition_pptx
, partition_rst
, and partition_rtf
Published by cragwolfe over 1 year ago
hi_res
PDF parsing strategy (from unstructured-inference bump to 0.5.4)partition_email
and partition_msg
will now process attachments if process_attachments=True
attachment_partitioner=partition
.Published by MthwRobinson over 1 year ago
max_partition
parameter to partition_text
, partition_pdf
, partition_email
,partition_msg
and partition_xml
that sets a limit for the size of an individual1500
for everything except partition_xml
, which hasNone
.hi_res
model for pdfs and images is selectable via environment variable.-------
as list items.partition_html
Published by cragwolfe over 1 year ago
partition_xml
.partition_org
for processed Org Mode documents.Published by cragwolfe over 1 year ago
parse_email
for partition_eml
so that unstructured-api
passes the smoke testspartition_email
now works if there is no message content"fast"
strategy for partition_pdf
so that it's able to recursivelyPublished by MthwRobinson over 1 year ago
MIME
encodings for eml
files with one of the common encodings if a unicode
error occursdetect_file_encoding
eml
fileshtml_assemble_articles
kwarg to partition_html
to enable users to capture<article>
tags is captured when<article>
tags are present.xml
attribute on element
before looking for pagebreaks in partition_docx
.Published by yuming-long over 1 year ago
.docx
and .doc
when user or rendererpartition_docx
to include headers and footers in the output.partition_tsv
and associated tests. Make additional changes to detect_filetype
.partition_via_api
since we now require valid/empty api keysNone
instead of 1
when page number is not present in the metadata.None
indicates that page numbers are not being tracked for the documentNone
.Published by cragwolfe over 1 year ago
partition_pdf
for fast
strategy--fast
strategy on PDF documentspartition_rst
for processed ReStructured Text documents.Published by yuming-long over 1 year ago
partition_via_api
and partition_multiple_via_api
detect_filetype
and partition
.grpcio
import issue on weaviate.schema.validate_schema
for python 3.9 and 3.10detectron2
from source in DockerfilePublished by yuming-long over 1 year ago
strategy
parameter down from partition
for partition_image
text/plain
MIME typeconvert_office_doc
no longers prints file conversion info messages to stdout.partition_via_api
reflects the actual filetype for the file processed in the API.Published by MthwRobinson over 1 year ago
elements_to_json
and elements_from_json
read_txt_file
utility function to keep using spooled_to_bytes_io_if_needed
for xmlread_txt_file
utility function to handle file-like object from URLencoding
from partition_pdf
None
default for encodingtabulate
explicitly to dependenciesmetadata.page_number
of pptx filesPublished by MthwRobinson over 1 year ago
stage_for_weaviate
to stage unstructured
outputs for upload to Weaviate, along withPublished by MthwRobinson over 1 year ago
detectron2
from source is no longer required when using the local-inference
extra..pptx
parsing to include text in tables._add_element_metadata
that caused all elements to have page_number=1
.log
as a file extension for TXT files..eml
) files if an error related to the encoding is raised and the user has not specified an encoding.replace_mime_encodings
partition_html
when include_metadata=False
ValueError
now raises if file_filename
is not specified when you use partition_via_api
Published by yuming-long over 1 year ago
Published by cragwolfe over 1 year ago
Published by qued over 1 year ago
Published by MthwRobinson over 1 year ago
partition_csv
for CSV files.