unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

APACHE-2.0 License

Downloads
1.6M
Stars
5.8K
Committers
110
unstructured - 0.3.2

Published by MthwRobinson almost 2 years ago

0.3.2

  • Added translate_text brick for translating text between languages
  • Add an apply method to make it easier to apply cleaners to elements
unstructured - 0.3.1

Published by MthwRobinson almost 2 years ago

0.3.1

  • Added __init.py__ to partition
unstructured - 0.3.0

Published by MthwRobinson almost 2 years ago

0.3.0

  • Implement staging brick for Argilla. Converts lists of Text elements to argilla dataset classes.
  • Removing the local PDF parsing code and any dependencies and tests.
  • Reorganizes the staging bricks in the unstructured.partition module
  • Allow entities to be passed into the Datasaur staging brick
  • Added HTML escapes to the replace_unicode_quotes brick
  • Fix bad responses in partition_pdf to raise ValueError
  • Adds partition_html for partitioning HTML documents.
unstructured - 0.2.4

Published by yuming-long almost 2 years ago

  • Add an alternative way of importing Final to support google colab
unstructured - 0.2.3

Published by MthwRobinson almost 2 years ago

0.2.3

  • Add cleaning bricks for removing prefixes and postfixes
  • Add cleaning bricks for extracting text before and after a pattern
unstructured - 0.2.2

Published by MthwRobinson almost 2 years ago

0.2.2

  • Add staging brick for Datasaur
unstructured - 0.2.1

Published by MthwRobinson about 2 years ago

0.2.1

  • Added brick to convert an ISD dictionary to a list of elements
  • Update PDFDocument to use the from_file method
  • Added staging brick for CSV format for ISD (Initial Structured Data) format.
  • Added staging brick for separating text into attention window size chunks for transformers.
  • Added staging brick for LabelBox.
  • Added ability to upload LabelStudio predictions
  • Added utility function for JSONL reading and writing
  • Added staging brick for CSV format for Prodigy
  • Added staging brick for Prodigy
  • Added ability to upload LabelStudio annotations
  • Added text_field and id_field to stage_for_label_studio signature
Package Rankings
Top 1.48% on Pypi.org
Top 3.72% on Proxy.golang.org
Badges
Extracted from project README
Downloads Downloads