Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
APACHE-2.0 License
Bot releases are hidden (Show)
Published by christinestraub 6 months ago
page_number
metadata fields for HTML partition until we have a better strategy to decide page counting.cid
characters in embedded text extracted by pdfminer
.partition_docx()
handles short table rows. The DOCX format allows a table row to start late and/or end early, meaning cells at the beginning or end of a row can be omitted. While there are legitimate uses for this capability, using it in practice is relatively rare. However, it can happen unintentionally when adjusting cell borders with the mouse. Accommodate this case and generate accurate .text
and .metadata.text_as_html
for these tables.Published by cragwolfe 6 months ago
Published by cragwolfe 6 months ago
ListItem
elements could result in reusing the same memory location which then led to unexpected side effects when updating element IDs.Published by plutasnyy 6 months ago
basic
and by_title
. Remote chunkingUnstructuredTableTransformerModel
is able to return predicted table in cells formatPDF_ANNOTATION_THRESHOLD
environment variable to control the capture of embedded links in partition_pdf()
for fast
strategy.Published by scanny 6 months ago
start_index
in html
links extractionstrategy
arg value to _PptxPartitionerOptions
. This makes this paritioning option available for sub-partitioners to come that may optionally use inference or other expensive operations to improve the partitioning.starting_page_number
parameter to partitioning functions It applies to those partitioners which support page_number
in element's metadata: PDF, TIFF, XLSX, DOC, DOCX, PPT, PPTX.unique_element_ids
continues to be False
by default, utilizing text hashes.<b>
tags in HTML Now partition_html()
can extract text from <b>
tags inside container tags (like <div>
, <pre>
).Published by cragwolfe 7 months ago
partition
failures in 0.13.1.Published by ryannikolaidis 7 months ago
ElementType
s to extend future element typespartition_html()
swallowing some paragraphs. The partition_html()
only considers elements with limited depth to avoid becoming the text representation of a giant div. This fix increases the limit value.Published by ahmetmeleq 7 months ago
.metadata.is_continuation
to text-split chunks. .metadata.is_continuation=True
is added to second-and-later chunks formed by text-splitting an oversized Table
element but not to their counterpart Text
element splits. Add this indicator for CompositeElement
to allow text-split continuation chunks to be identified for downstream processes that may wish to skip intentionally redundant metadata values in continuation chunks.compound_structure_acc
metric to table eval. Add a new property to unstructured.metrics.table_eval.TableEvaluation
: composite_structure_acc
, which is computed from the element level row and column index and content accuracy scores.metadata.orig_elements
to chunks. .metadata.orig_elements: list[Element]
is added to chunks during the chunking process (when requested) to allow access to information from the elements each chunk was formed from. This is useful for example to recover metadata fields that cannot be consolidated to a single value for a chunk, like page_number
, coordinates
, and image_base64
.--include_orig_elements
option to Ingest CLI. By default, when chunking, the original elements used to form each chunk are added to chunk.metadata.orig_elements
for each chunk. * The include_orig_elements
parameter allows the user to turn off this behavior to produce a smaller payload when they don't need this metadata..metadata.orig_elements
for each chunk. This behavior allows the text and metadata of the elements combined to make each chunk to be accessed. This can be important for example to recover metadata such as .coordinates
that cannot be consolidated across elements and so is dropped from chunks. This option is controlled by the include_orig_elements
parameter to partition_*()
or to the chunking functions. This option defaults to True
so original-elements are preserved by default. This behavior is not yet supported via the REST APIs or SDKs but will be in a closely subsequent PR to other unstructured
repositories. The original elements will also not serialize or deserialize yet; this will also be added in a closely subsequent PR.clean_pdfminer_inner_elements()
to remove only pdfminer (embedded) elements merged with inferred elements. Previously, some embedded elements were removed even if they were not merged with inferred elements. Now, only embedded elements that are already merged with inferred elements are removed.skip_infer_table_types
parameter and reflect these changes in documentation.Published by ron-unstructured 8 months ago
partition_pdf()
for fast
strategy Previously, a threshold value that affects the capture of embedded links was set to a fixed value by default. This allows users to specify the threshold value for better capturing.add_chunking_strategy
decorator to dispatch by name. Add chunk()
function to be used by the add_chunking_strategy
decorator to dispatch chunking call based on a chunking-strategy name (that can be dynamic at runtime). This decouples chunking dispatch from only those chunkers known at "compile" time and enables runtime registration of custom chunkers..name
not a local file path. When partitioning a file using the file=
argument, and file
is a file-like object (e.g. io.BytesIO) having a .name
attribute, and the value of file.name
is not a valid path to a file present on the local filesystem, FileNotFoundError
is raised. This prevents use of the file.name
attribute for downstream purposes to, for example, describe the source of a document retrieved from a network location via HTTP.pandoc
which does not support RTF files + instructions that will help resolve that issue.install-pandoc
Makefile recipe into relevant stages of CI workflow, ensuring it is a version that supports RTF input files.-1.0
with np.nan
and corrected rows filtering of files metrics basing on that.Published by christinestraub 8 months ago
partition_pdf
with fast
strategy nowidentify_overlapping_or_nesting_case
and catch_overlapping_and_nested_bboxes
functions.partition_via_api()
Update partition_via_api()
to convert all list type parameters to JSON formatted strings before calling the unstructured client SDK. This will support image block extraction via partition_via_api()
.check_connection
in opensearch, databricks, postgres, azure connectorscheck_connection
in opensearch, databricks, postgres, azure connectors **partition_xlsx()
that dropped content. Algorithm for detecting "subtables" within a worksheet dropped table elements for certain patterns of populated cells such as when a trailing single-cell row appeared in a contiguous block of populated cells.Key Concepts
page.OpenAiEmbeddingConfig
to OpenAIEmbeddingConfig
.@add_chunking_strategy
decorator was missing from partition_json()
such that pre-partitioned documents serialized to JSON did not chunk when a chunking-strategy was specified.Published by ahmetmeleq 9 months ago
black
formatting The black
library recently introduced a new major version that introduces new formatting conventions. This change brings code in the unstructured
repo into compliance with the new conventions..p7s
files partition_email
can now process .p7s
files. The signature for the signed message is extracted and added to metadata.partition_email
now falls back to anoter valid content type if it's available.OCRAgent
interface and specify it using OCR_AGENT
environment variable.partition_pdf()
not working when using chipper model with file
languages
and ocr_languages
Users are regularly receiving errors on the API because they are defining ocr_languages
or languages
with additional quotationmarks, brackets, and similar mistakes. This update handles common incorrect arguments and raises an appropriate warning.hi_res_model_name
now relies on unstructured-inference
When no explicit hi_res_model_name
is passed into partition
or partition_pdf_or_image
the default model is picked by unstructured-inference
's settings or os env variable UNSTRUCTURED_HI_RES_MODEL_NAME
; it now returns the same model name regardless of infer_table_structure
's value; this function will be deprecated in the future and the default model name will simply rely on unstructured-inference
and will not consider os env in a future release.Published by ahmetmeleq 9 months ago
unstructured
version information to theunstructured-ingest
to write partitioned data to a Databricks Volumes storage service.check_connection
. There was an error when trying to ls
destination directory - it may not exist at the moment of connector creation. Now check_connection
calls ls
on bucket root and this method is called on initialize
of destination connector.setup.py
is currently pointing to the wrong location for the databricks-volumes extra requirements. This results in errors when trying to build the wheel for unstructured. This change updates to point to the correct path.Published by ron-unstructured 9 months ago
unstructured-inference
version to address andPublished by ron-unstructured 9 months ago
partition_image
Adds support for .bmp
files inpartition
, partition_image
, and detect_filetype
.Image
elements with small chunks of text were ignored unless the image block extraction parameters (extract_images_in_pdf
or extract_image_block_types
) were specified. Now, all image elements are kept regardless of whether the image block extraction parameters are specified..wav
files. Add filetpye detection for .wav
files.private-key-file
has been renamed to private-key
. Private key can be provided as path to file or file contents.Staging
bricks in favor of Destination Connectors
, (iii) added warning and code examples to use the SaaS API Endpoints using CLI-vs-SDKs, (iv) fixed example pages formatting, (v) added deprecation on model_name
in favor of hi_res_model_name
, (vi) added extract_images_in_pdf
usage in partition_pdf
section, (vii) reorganize and improve the documentation introduction section, and (viii) added PDF table extraction best practices.api_key_auth
to UnstructuredClient
Published by awalker4 10 months ago
Published by cragwolfe 10 months ago
Published by awalker4 10 months ago
overlap
kwarg on partition functions.image_base64
and image_mime_type
(if that is what the user specifies by some other param like pdf_extract_to_payload
). This would allow the API to have parity with the library.partition_via_api
Users that self host the api were not able to pass their custom url to partition_via_api
.Published by cragwolfe 10 months ago
final
elements. The updated script also supports annotating inferred
and extracted
elements.staging_for
brickstext/plain
and text/html
.unstructured-ingest
to write partitioned/embedded data to a Chroma vector database.Published by cragwolfe 10 months ago
partition_pdf()
and partition_image()
importation issue. Reorganize pdf.py
and image.py
modules to be consistent with other types of document import code.Published by cragwolfe 10 months ago
unstructured-inference
to unstructured
.unstructured-inference
to unstructured
.sensitive
annotation for fields related to auth (i.e. passwords, tokens). Refactor all fsspec connectors to use explicit access configs rather than a generic dictionary.table-<pageN>-<tableN>.jpg
. This filename is presented in the image_path
metadata field for the Table element. The default would be to not do this.unstructured-ingest
to write partitioned data from over 20 data sources (so far) to a Weaviate object collection.hi_res
partitioning failure when pdfminer fails. Implemented logic to fall back to the "inferred_layout + OCR" if pdfminer fails in the hi_res
strategy.tesseract
can handle for ocr layout detectionpartition_csv
now identifies the correct delimiter before the file is processed.hi_res
occasionally pdfminer can fail to decode the text in an pdf file and return cid code as text. Now when this happens the text from OCR is used.