Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
APACHE-2.0 License
Bot releases are hidden (Show)
Published by cragwolfe 11 months ago
Published by pravin-unstructured 11 months ago
Use pikepdf
to repair invalid PDF structure for PDFminer when we see error PSSyntaxError
when PDFminer opens the document and creates the PDFminer pages object or processes a single PDF page.
Batch Source Connector support For instances where it is more optimal to read content from a source connector in batches, a new batch ingest doc is added which created multiple ingest docs after reading them in in batches per process.
<style>
tags in HTML. <style>
tags containing CSS in invalid positions previously contributed to element text. Do not consider text node of a <style>
element as textual content.Header
and Footer
document elements.Published by yuming-long 11 months ago
PartitionStrategy
for the strategy constants and use the constants to replace strategy strings.DEFAULT_PADDLE_LANG
before we have the language mapping for paddle.element.metadata.coefficient = 0.58
. These fields will round-trip through JSON and can be accessed with dotted notation.TYPE_TO_TEXT_ELEMENT_MAP
Updated Figure
mapping from FigureCaption
to Image
.pdfminer
, causing partition_pdf()
to fail. We expect to be able to partition smoothly using an alternative strategy if text extraction doesn't work. Added exception handling to handle unexpected errors when extracting pdf text and to help determine pdf strategy.fast
strategy fall back to ocr_only
The fast
strategy should not fall back to a more expensive strategy.languages
in metadata when partitioning strategy=hi_res
or fast
User defined languages
was previously used for text detection, but not included in the resulting element metadata for some strategies. languages
will now be included in the metadata regardless of partition strategy for pdfs and images.KeyError: 'N'
Certain pdfs were throwing this error when being opened by pdfminer. Added a wrapper function for pdfminer that allows these documents to be partitioned.Table
chunks. Remedies repeated appearance of full .text_as_html
on metadata of each TableChunk
split from a Table
element too large to fit in the chunking window.<thead>
or <tfoot>
element was emitted as a table element having no text and unparseable HTML in element.metadata.text_as_html
. Do not emit empty tables to the element stream.element.metadata.text_as_html
contains spurious elements in invalid locations. The HTML generated for the text_as_html
metadata for HTML tables contained <br>
elements invalid locations like between <table>
and <tr>
. Change the HTML generator such that these do not appear.<thead>
or <tfoot>
element were not detected and the text in those cells was omitted from the table element text and .text_as_html
. Detect table rows regardless of the semantic tag they may be nested in..text_as_html
. tabulate
inserts padding spaces to achieve visual alignment of columns in HTML tables it generates. Add our own HTML generator to do this simple job and omit that padding as well as newlines ("\n") used for human readability.output-dir/input-filename.json
Published by cragwolfe 12 months ago
check_connection()
method which makes sure a valid connection can be established with the source/destination given any authentication credentials in a lightweight request.url
. partition
now accepts a new optional parameter request_timeout
which if set will prevent any requests.get
from hanging indefinitely and instead will raise a timeout error. This is useful when partitioning a url that may be slow to respond or may not respond at all._determine_pdf_auto_strategy
returned hi_res
strategy only if infer_table_structure
was true. It now returns the hi_res
strategy if either infer_table_structure
or extract_images_in_pdf
is true.0
. A logical check is now added to avoid such error.ocr_only
Tables that contain only numbers are returned as floats in a pandas.DataFrame when the image is converted from .image_to_data()
. An AttributeError was raised downstream when trying to .strip()
the floats.w:lastRenderedPageBreak
elements present in the document XML. Page breaks are NOT reliably indicated by "hard" page-breaks inserted by the author and when present are redundant to a w:lastRenderedPageBreak
element so cause over-counting if used. Use rendered page-breaks only.Published by yuming-long 12 months ago
SourceConnectionNetworkError
custom error, which triggers the retry logic, if enabled, in the ingest pipeline.additional_partition_args
arg was added to allow users to pass in any other arguments that should be added when calling partition(). This helps keep any changes to the input parameters of the partition() exposed in the CLI.partition_text
to prevent empty elements Adds a check to filter out empty bullets.ocr_languages
with values for languages
Some API users ran into an issue with sending languages
params because the API defaulted to also using an empty string for ocr_languages
. This update handles situations where languages
is defined and ocr_languages
is an empty string.annots
that resolved out as None. A logical check added to avoid such error.Estimating resolution as X
leaded by invalid language parameters input. Proceed with defalut language eng
when lang.py
fails to find valid language code for tesseract, so that we don't pass an empty string to tesseract CLI and raise an exception in downstream.Published by cragwolfe 12 months ago
yolox
by default for table extraction when partitioning pdf/image yolox
model provides higher recall of the table regions than the quantized version and it is now the default element detection model when infer_table_structure=True
for partitioning pdf/image fileshi_res
some elements where extracted using pdfminer too, so we removed pdfminer from the tables pipeline to avoid duplicated elements.unstructured-ingest
to write to any of the following:
ocr_only
strategy in partition_pdf()
Adds the functionality to get accurate coordinate data when partitioning PDFs and Images with the ocr_only
strategy.tables
extension when instantiating the python-markdown
object. Importance: This will allow users to extract structured data from tables in markdown documents.get_uris_from_annots
function tried to access the dictionary value of a string instance variable. Assign None
to the annotation variable if the instance type is not dictionary to avoid the erroneous attempt.Published by qued 12 months ago
ebooklib
as a dependency ebooklib
is licensed under AGPL3, which is incompatible with the Apache 2.0 license. Thus it is being removed.re_download
to dictate if files should be forced to redownload rather than use what might already exist locally.Published by qued 12 months ago
overlapping_elements
, overlapping_case
, overlapping_percentage
, largest_ngram_percentage
, overlap_percentage_total
, max_area
, min_area
, and total_area
.typing-extensions
as an explicit dependency This package is an implicit dependency, but the module is being imported directly in unstructured.documents.elements
so the dependency should be explicit in case changes in other dependencies lead to typing-extensions
being dropped as a dependency.extract_tables
to unstructured-inference
since it is now supported in unstructured
instead Table extraction previously occurred in unstructured-inference
, but that logic, except for the table model itself, is now a part of the unstructured
library. Thus the parameter triggering table extraction is no longer passed to the unstructured-inference
package. Also noted the table output regression for PDF files.skip_infer_table_types
variable used in partition
was not being passed down to specific file partitioners. Now you can utilize the skip_infer_table_types
list variable when calling partition
to specify the filetypes for which you want to skip table extraction, or the infer_table_structure
boolean variable on the file specific partitioning function.Published by shreyanid about 1 year ago
Click
based cli ingest commands are added dynamically from a number of configs, a check was incorporated to make sure there were no duplicate entries to prevent new configs from overwriting already added options.OCR_AGENT
for OCRing the entire document.calculate_edit_distance
. For easy function call, it is now a wrapper around the original function that calls edit_distance and return as "score".unstructured.embed.bedrock
now provides a connector to use AWS bedrock's titan-embed-text
model to generate embeddings for elements. This features requires valid AWS bedrock setup and an internet connectionto run.PDFResourceManager
from pdfminer.converter
which was causing an error for some users. We changed to import from the actual location of PDFResourceManager
, which is pdfminer.pdfinterp
.langdetect
if the language was attempted to be detected on an empty string. Language detection is now skipped for empty strings.regex_metadata
was used, where every element that contained a regex-match would start a new chunk.__init__.py
file under the folder.partition_pdf
get model_name=None
In API usage the model_name
value is None
and the cast
function in partition_pdf
would return None
and lead to attribution error. Now we use str
function to explicit convert the content to string so it is guaranteed to have starts_with
and other string functions as attributestbody
tag HTML tables may sometimes just contain headers without body (tbody
tag)max_characters
.Published by awalker4 about 1 year ago
OCR
elements with only spaces in the text have full-page width in the bounding box, which causes the xycut
sorting to not work as expected. Now the logic to parse OCR results removes any elements with only spaces (more than one space).__init__.py
file under the folder.Published by cragwolfe about 1 year ago
points
is limited to 1 decimal point if coordinates["system"] == "PixelSpace" (otherwise 2 decimal points?). Precision for detection_class_prob
is limited to 5 decimal points.under_non_alpha_ratio
dividing by zero Although this function guarded against a specific cause of division by zero, there were edge cases slipping through like strings with only whitespace. This update more generally prevents the function from performing a division by zero.Published by awalker4 about 1 year ago
unstructured-inference
to 0.7.3
The updated version of unstructured-inference
supports a new version of the Chipper model, as well as a cleaner schema for its output classes. Support is included for new inference features such as hierarchy and ordering.--skip-infer-table-types
parameter was added to map to the skip_infer_table_types
partition argument. This gives more granular control to unstructured-ingest users, allowing them to specify the file types for which we should attempt table extraction.metadata.links
, metadata.link_texts
and metadata.link_urls
for elements that contain a hyperlink that points to an external resource. So-called "jump" links pointing to document internal locations (such as those found in a table-of-contents "jumping" to a chapter or section) are excluded.Add elements_to_text
as a staging helper function In order to get a single clean text output from unstructured for metric calculations, automate the process of extracting text from elements using this function.
Adds permissions(RBAC) data ingestion functionality for the Sharepoint connector. Problem: Role based access control is an important component in many data storage systems. Users may need to pass permissions (RBAC) data to downstream systems when ingesting data. Feature: Added permissions data ingestion functionality to the Sharepoint connector.
run
method. This allows users to specify that connectors fetch embeddings without failure.__init__.py
in order to make it discoverable.Published by cragwolfe about 1 year ago
Published by ryannikolaidis about 1 year ago
langdetect
package. Additional param detect_language_per_element
is also added for partitioners that return multiple elements. Defaults to False
.xy-cut
sorting: Update shrink_bbox()
to keep top left rather than center.max_characters=<n>
argument to all element types in add_chunking_strategy
decorator Previously this argument was only utilized in chunking Table elements and now applies to all partitioned elements if add_chunking_strategy
decorator is utilized, further preparing the elements for downstream processing.bag_of_words
and percent_missing_text
functions In order to count the word frequencies in two input texts and calculate the percentage of text missing relative to the source document.edit_distance
calculation metrics In order to benchmark the cleaned, extracted text with unstructured, edit_distance
(Levenshtein distance
) is included.metadata
module had several top level imports that were only used in and applicable to code related to specific document types, while there were many general-purpose functions. As a result, general-purpose functions couldn't be used without unnecessary dependencies being installed. Fix: moved 3rd party dependency top level imports to inside the functions in which they are used and applied a decorator to check that the dependency is installed and emit a helpful error message if not.Title
elements from chipper
get category_depth
= None even when Headline
and/or Subheadline
elements are present in the same page. Fix: all Title
elements with category_depth
= None should be set to have a depth of 0 instead iff there are Headline
and/or Subheadline
element-types present. Importance: Title
elements should be equivalent html H1
when nested headings are present; otherwise, category_depth
metadata can result ambiguous within elements in a page.xy-cut
ordering output to be more column friendly This results in the order of elements more closely reflecting natural reading order which benefits downstream applications. While element ordering from xy-cut
is usually mostly correct when ordering multi-column documents, sometimes elements from a RHS column will appear before elements in a LHS column. Fix: add swapped xy-cut
ordering by sorting by X coordinate first and then Y coordinate.GoToR
which refers to pdf resources outside of its own was detected since no condition catches such case. The code is fixing the issue by initialize URI before any condition check.Published by cragwolfe about 1 year ago
.xlsx
file type at Element level.unstructured-inference
to 0.6.6
The updated version of unstructured-inference
makes table extraction in hi_res
mode configurable to fine tune table extraction performance; it also improves element detection by adding a deduplication post processing step in the hi_res
partitioning of pdfs and images.add_chunking_strategy
decorator to partition functions. In addition to combining elements under Title elements, user's can now specify the max_characters=<n>
argument to chunk Table elements into TableChunk elements with text
and text_as_html
of length characters. This means partitioned Table results are ready for use in downstream applications without any post processing.hi_res
model for pdf/image partition to yolox
Now partitioning pdf/image using hi_res
strategy utilizes yolox_quantized
model isntead of detectron2_onnx
model. This new default model has better recall for tables and produces more detailed categories for elements.partition_xlsx
now can reads subtable(s) within one .xlsx sheet, along with extracting other title and narrative texts. Importance: This enhance the power of .xlsx reading to not only one table per sheet, allowing user to capture more data tables from the file, if exists.partition_pdf
when attempt to get bounding box from element experienced a reference before assignment error when the first object is not text extractable. Fix: Switched to a flag when the condition is met. Importance: Crucial to be able to partition with pdf.detection_class_prob
appears in Element metadata Problem: when detection_class_prob
appears in Element metadata, Elements will only be combined by chunk_by_title if they have the same detection_class_prob
value (which is rare). This is unlikely a case we ever need to support and most often results in no chunking. Fix: detection_class_prob
is included in the chunking list of metadata keys excluded for similarity comparison. Importance: This change allows chunk_by_title
to operate as intended for documents which include detection_class_prob
metadata in their Elements.Published by cragwolfe about 1 year ago
xy-cut
sorting to preprocess bboxes, shrinking all bounding boxes by 90% along x and y axes (still centered around the same center point), which allows projection lines to be drawn where not possible before if layout bboxes overlapped.partition_xml
to be faster and more memory efficient when partitioning large XML files The new behavior is to partition iteratively to prevent loading the entire XML tree into memory at once in most use cases.unstructured-ingest
to write partitioned data from over 20 data sources (so far) to an Azure Cognitive Search index.langdetect
package. Adds the document languages as ISO 639-3 codes to the element metadata. Implemented only for the partition_text function to start.links
metadata in partition_pdf
for fast
strategy. Problem: PDF files contain rich information and hyperlink that Unstructured did not captured earlier. Feature: partition_pdf
now can capture embedded links within the file along with its associated text and page number. Importance: Providing depth in extracted elements give user a better understanding and richer context of documents. This also enables user to map to other elements within the document if the hyperlink is refered internally.pip install unstructured[all-docs]
it will now upgrade both unstructured and unstructured-inference. Importance: This will ensure that the inference library is always in sync with the unstructured library, otherwise users will be using outdated libraries which will likely lead to unintended behavior.__post_init__
. Fix: Adds a try/catch when the IngestConnector runs get_ingest_docs such that the error is logged but all processable documents->IngestDocs are still instantiated and returned. Importance: Allows users to ingest SharePoint content even when some files with unsupported filetypes exist there.make html
and installs library to suppress warnings.partition_via_api
, the hosted api may return an element schema that's newer than the current unstructured
. In this case, metadata fields were added which did not exist in the local ElementMetadata
dataclass, and __init__()
threw an error. Fix: remove nonexistent fields before instantiating in ElementMetadata.from_json()
. Importance: Crucial to avoid breaking changes when adding fields.None
Problem: Getting the jump_url
from a nonexistent Discord channel
fails. Fix: property jump_url
is now retrieved within the same context as the messages from the channel. Importance: Avoids cascading issues when the connector fails to fetch information about a Discord channel.deltalake
on Linux Problem: occasionally on Linux ingest can throw a SIGABTR
when writing deltalake
table even though the table was written correctly. Fix: put the writing function into a Process
to ensure its execution to the fullest extent before returning to the main process. Importance: Improves stability of connectors using deltalake
Published by amanda103 about 1 year ago
languages
param in any Tesseract-supported langcode or any ISO 639 standard language code.Published by cragwolfe about 1 year ago
unstructured
element categories so the consumer of the library would see many UncategorizedText
elements. This fixes the issue, improving the granularity of the element categories outputs for better downstream processing and chunking. The mapping update is:
NarrativeText
NarrativeText
Title
NarrativeText
NarrativeText
Title
(with category_depth=1
)Title
(with category_depth=2
)NarrativeText
partition_pdf
with fast
strategy previously broke down some numbered list item lines as separate elements. This enhancement leverages the x,y coordinates and bbox sizes to help decide whether the following chunk of text is a continuation of the immediate previous detected ListItem element or not, and not detect it as its own non-ListItem element.UncategorizedText
elements.Table
Elements are now propery extracted.add_chunking_strategy
decorator to partition functions. Previously, users were responsible for their own chunking after partitioning elements, often required for downstream applications. Now, individual elements may be combined into right-sized chunks where min and max character size may be specified if chunking_strategy=by_title
. Relevant elements are grouped together for better downstream results. This enables users immediately use partitioned results effectively in downstream applications (e.g. RAG architecture apps) without any additional post-processing.languages
as an input parameter and marks ocr_languages
kwarg for deprecation in pdf, image, and auto partitioning functions. Previously, language information was only being used for Tesseract OCR for image-based documents and was in a Tesseract specific string format, but by refactoring into a list of standard language codes independent of Tesseract, the unstructured
library will better support languages
for other non-image pipelines and/or support for other OCR engines.UNSTRUCTURED_LANGUAGE
env var usage and replaces language
with languages
as an input parameter to unstructured-partition-text_type functions. The previous parameter/input setup was not user-friendly or scalable to the variety of elements being processed. By refactoring the inputted language information into a list of standard language codes, we can support future applications of the element language such as detection, metadata, and multi-language elements. Now, to skip English specific checks, set the languages
parameter to any non-English language(s).xlsx
and xls
filetype extensions to the skip_infer_table_types
default list in partition
. By adding these file types to the input parameter these files should not go through table extraction. Users can still specify if they would like to extract tables from these filetypes, but will have to set the skip_infer_table_types
to exclude the desired filetype extension. This avoids mis-representing complex spreadsheets where there may be multiple sub-tables and other content.unstructured
s NLP internals.unstructured_pytesseract.run_and_get_multiple_output
function to reduce the number of calls to tesseract
by half when partitioning pdf or image with tesseract
unstructured-ingest
to write partitioned data from over 20 data sources (so far) to a Delta Table.partition_html
would return HTML-elements but now we preserve the format from the input using source_format
argument in the partition call.PaddleOCR
as an optional alternative to Tesseract
for OCR in processing of PDF or Image files, it is installable via the makefile
command install-paddleocr
. For experimental purposes only.metadata.text_as_html
in an element. These changes include:
ENTIRE_PAGE_OCR
to specify using paddle or tesseract on entire page OCRcells_to_html
doesn't handle cells spanning multiple rows properly (0.5.25)cv2
preprocessing step before OCR step in table transformer (0.5.24)category_depth
with default value None.
parent_id
on the element's metadata
add_pytesseract_bboxes_to_elements
no longer returns nan
values. The function logic is now broken into new methods_get_element_box
and convert_multiple_coordinates_to_new_system
partition_image
. Problem: partition_pdf
allows for passing a model_name
parameter. Given the similarity between the image and PDF pipelines, the expected behavior is that partition_image
should support the same parameter, but partition_image
was unintentionally not passing along its kwargs
. This was corrected by adding the kwargs to the downstream call.Published by rbiseck3 about 1 year ago
Published by cragwolfe about 1 year ago
_detect_filetype_from_octet_stream()
function to use libmagic to infer the content type of file when it is not a zip file.clean_ligatures
function to expand ligatures in textpartition_html
breaks on <br>
elements.