argilla

Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets

APACHE-2.0 License

Downloads
375.3K
Stars
3.7K
Committers
92

Bot releases are visible (Hide)

argilla -

Published by frascuchon over 1 year ago

1.2.2

Bug Fixes

  • Copying datasets between workspaces with proper owner/workspace info. Closes #2562
  • Copy dataset with empty workspace to the default user workspace 905d4de
  • Using elasticsearch config to request backend version. Closes #2311
argilla - v.1.5.0

Published by frascuchon over 1 year ago

πŸ”† Highlights

Dataset Settings page

We have added a Settings page for your datasets. From there, you will be able to manage your dataset. Currently, it is possible to add labels to your labeling schema and delete the dataset.

Add images to your records

The image in this record was generated using https://robohash.org

You can pass a URL in the metadata field _image_url and the image will be rendered in the Argilla UI. You can use this in the Text Classification and the Token Classification tasks.

Non-searchable metadata fields

Apart from the _image_url field you can also pass other metadata fields that won't be used in queries or filters by adding an underscore at the start e.g. _my_field.

Load only what you need using rg.load

You can now specify the fields you want to load from your Argilla dataset. That way, you can avoid loading heavy vectors if you're using them for your annotations.

Two new tutorials (kudos @embonhomme & @burtenshaw)

Check out our new tutorials created by the community!

  • Compare the performance of two text classification models here
  • Multimodal bulk annotation here

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.5.0 - 2023-03-21

Added

  • Add the fields to retrieve when loading the data from argilla. rg.load takes too long because of the vector field, even when users don't need it. Closes #2398
  • Add new page and components for dataset settings. Closes #2442
  • Add ability to show image in records (for TokenClassification and TextClassification) if an URL is passed in metadata with the key _image_url
  • Non-searchable fields support in metadata. #2570

Changed

  • Labels are now centralized in a specific vuex ORM called GlobalLabel Model, see https://github.com/argilla-io/argilla/issues/2210. This model is the same for TokenClassification and TextClassification (so both task have labels with color_id and shortcuts parameters in the vuex ORM)
  • The shortcuts improvement for labels #2339 have been moved to the vuex ORM in dataset settings feature #2444
  • Update "Define a labeling schema" section in docs.
  • The record inputs are sorted alphabetically in UI by default. #2581

Fixes

  • Allow URL to be clickable in Jupyter notebook again. Closes #2527

Removed

  • Removing some data scan deprecated endpoints used by old clients. This change will break compatibility with client <v1.3.0
  • Stop using old scan deprecated endpoints in python client. This logic will break client compatibility with server version <1.3.0
  • Remove the previous way to add labels through the dataset page. Now labels can be added only through dataset settings page.

As always, thanks to our amazing contributors!

  • Documentation update: tutorial for text classification models comparison (#2426) by @embonhomme
  • Docs: fix little typo (#2522) by @anakin87
  • Docs: Tutorial on image classification (#2420) by @burtenshaw
argilla - v1.4.0

Published by frascuchon over 1 year ago

πŸ”† Highlights

Enhanced annotation flow for all tasks

Improved bulk annotation and actions

A more stylish banner for available global actions. It includes an improved label selector to apply and remove labels in bulk.

We enhanced multi-label text classification annotations and now adding labels in bulk doesn't remove previous labels. This action will change the status of the records to Pending and you will need to validate the annotation to save the changes.

Learn more about bulk annotations and multi-level text classification annotations in our docs.

Clear and Reset actions

New actions to clear all annotations and reset changes. They can be used at the record level or as bulk actions.

Unvalidate and undiscard

Click the Validate or Discard buttons in a record to undo this action.

Optimized one-record view

Improved view for a single record to enable a more focused annotation experience.

Prepare for training for SparkNLP Text2Text

Extended support to prepare Text2Text datasets for training with SparkNLP.

Learn more in our docs.

Extended shortcuts for token classification (kudos @cceyda)

In token classification tasks that have 10+ options, labels get assigned QWERTY keys as shortcuts.

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.4.0 (2023-03-09)

Features

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

  • Documentation update: adding missing n (#2362) by @Gnonpi
  • feat: Extend shortcuts to include alphabet for token classification (#2339) by @cceyda
argilla - v1.3.1

Published by frascuchon over 1 year ago

1.3.1 (2023-02-24)

Bug Fixes

Documentation

argilla - v1.3.0

Published by frascuchon over 1 year ago

πŸ”† Highlights

Keyword metric from Python client

Most important keywords in the dataset or a subset (using the query param) can be retrieved from Python. This can be useful for EDA and defining programmatic labeling rules:

from argilla.metrics.commons import keywords
summary = keywords(name="example-dataset")
summary.visualize() # will plot an histogram with results
summary.data # returns the raw result data

Prepare for training for SparkNLP and spaCy text-cat

Added a new framework sparknlp and extended the support for spacy including text classification datasets. Check out this section of the docs

Create train and test split with prepare_for_training

You can pass train_size and test_size to prepare_for_training to get train-test splits. This is especially useful for spaCy. Check out this section of the docs

Better repr for Dataset and Rule (kudos @Ankush-Chander)

When using the Python client now you get a human-readable visualization of Dataset and Rule entities

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.3.0 (2023-02-09)

Features

Bug Fixes

  • Client: formatting caused offset in prediction (#2241) (d65db5a)
  • Client: Log remaining data when shutdown the dataset consumer (#2269) (d78963e), closes #2189
  • validate predictions fails on text2text (#2271) (f68856e), closes #2252

Visual enhancements

Documentation

As always, thanks to our amazing contributors!

  • add repr method for Rule, Dataset. (#2148) by @Ankush-Chander
  • opensearch docker compose file doesn't run (#2228) by @kayvane1
  • Docs: fix typo in documentation (#2296) by @anakin87
argilla - v1.2.1

Published by frascuchon over 1 year ago

1.2.1 (2023-01-23)

Bug Fixes

argilla - v1.2.0

Published by frascuchon almost 2 years ago

1.2.0 (2023-01-12)

πŸ”† Highlights

Data labelling and curation with similarity search

Since 1.2.0 Argilla supports adding vectors to Argilla records which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based).

View record info

You can now find all record details and fields which can be useful for bookmarking, copy/pasting, and making ES queries

View record timestamp

You can now see the timestamp associated with the record timestamp (event timestamp) which corresponds to the moment when the record was uploaded or a custom timestamp passed when logging the data (e.g., when the prediction was made when using it for monitoring)

Configure the base path of your Argilla UI (useful for proxies)

See: https://docs.argilla.io/en/latest/getting_started/installation/server_configuration.html#using-a-proxy

Features

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

  • Add Azure deployment tutorial (#2124) by @burtenshaw
  • Create training-textclassification-activelearning-with-GPU.ipynb (#2020) by @MoritzLaurer
argilla - v1.1.1

Published by frascuchon almost 2 years ago

1.1.1 (2022-11-29)

Bug Fixes

Documentation

argilla - v1.1.0

Published by frascuchon almost 2 years ago

1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

# Read a file with keywords or phrases
labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")

# Create rules
predefined_labeling_rules = []
for index, row in labeling_rules_df.iterrows():
    predefined_labeling_rules.append(
        Rule(row["query"], row["label"])
    )

# Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules

You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

Sort by timestamp fields in the UI

Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

Features

  • #1929 add warning about using wrong hostnames (#1930) (a3bc554)
  • Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
  • Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
  • Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
  • Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
  • Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
  • Update error page (#1932) (caeb7d4), closes #1894
  • Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

  • docs: Link key features (#1805) (#1809) by @chschroeder
  • View Docs link in frontend header users.vue (#1915) by @bengsoon
  • fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki
argilla - v1.0.1

Published by frascuchon almost 2 years ago

1.0.1 (2022-11-04)

Bug Fixes

Documentation

  • corrected for tutorial and api redirections (#1820) (26ccdcc)
argilla - v0.19.0

Published by frascuchon almost 2 years ago

argilla - v0.18.0

Published by frascuchon about 2 years ago

0.18.0 (2022-10-05)

⚑ Highlights

Better validation of token classification records

When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens.
Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

For example, the following record:

import rubrix as rb

rb.TokenClassificationRecord(
    tokens=["I", "love", "Paris"],
    text="I love Paris!",
    prediction=[("LOC",7,13)]
)

Will give you the following error message:

ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [Paris!] defined in ...love Paris!
Tokens:
['I', 'love', 'Paris']

Delete records by query

Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

import rubrix as rb

## Delete by id
rb.delete_records(name="example-dataset", ids=[1,3,5])

## Discard records by query
rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)

New tutorials

We have two new tutorials!

Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html
https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Features

Bug Fixes

Visual enhancements

Documentation

  • Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
  • Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
  • fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693
  • raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
  • Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
  • using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

As always, thanks to our amazing contributors!

  • refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander
  • feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander
  • fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo
  • docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder
  • refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia
  • docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen
  • refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen
argilla - v0.17.0

Published by frascuchon about 2 years ago

0.17.0 (2022-08-22)

⚑ Highlights

Preparing a training set in the spaCy DocBin format

prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

import spacy
import rubrix as rb

from datasets import load_dataset

# Load annotated dataset from Rubrix
rb_dataset = rb.load("ner_dataset")

# Loading an spaCy blank language model to create the Docbin, as it works faster
nlp = spacy.blank("en")

# After this line, the file will be stored in disk
rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")

You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

Load large datasets using batches

Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

An example of reading the first 1000 records and the next batch of up to 1000 records:

import rubrix as rb
dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)

The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

Larger pagination sizes for faster bulk review and annotation

Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

Copy record text to clipboard

Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

Better error logging for generic errors

Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

Features

  • Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
  • Allow rb.load fetch records in batches passing the from_id argument (3e6344a)
  • Copy to clipboard the record text (#1625) (d634a7b), closes #1616
  • Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
  • Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
  • prepare_for_training supports spacy (#1635) (8587808)

Bug Fixes

Documentation

Visual enhancements

You can see all work included in the release here

  • fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
  • chore: configure miniconda for readthedocs builder by @frascuchon
  • style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
  • feat: Copy to clipboard the record text (#1625) by @leiyre
  • docs: Add Slack support link in README's get started (#1688) by @dvsrepo
  • chore: update version by @frascuchon
  • feat: Add new pagination size ranges (#1667) by @leiyre
  • fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
  • feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras
  • fix(Client): reusing the inner httpx client (#1640) by @frascuchon
  • feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
  • docs: spacy DocBin cookbook (#1642) by @ignacioct
  • feat: prepare_for_training supports spacy (#1635) by @frascuchon
  • style: Improve card spacing (#1638) by @leiyre
  • docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
  • chore: remove old rubrix client class (#1639) by @frascuchon
  • feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
  • doc: show metric graphs in documentation (#1669) by @leiyre
  • fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
  • fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre
argilla - v0.16.1

Published by frascuchon over 2 years ago

0.16.1 (2022-07-22)

Bug Fixes

  • 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
  • Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
  • Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

Documentation

You can see all work included in the release here

  • fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
  • fix: Display metadata in Text2Text dataset (#1626) by @leiyre
  • chore: set version by @dcfidalgo
  • docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
  • fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre
argilla -

Published by frascuchon over 2 years ago

0.16.0 (2022-07-08)

Highlights

πŸ‘‚ Listeners: enable more interactive workflows between client and server

Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

from rubrix.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = trec["train"].features["label-coarse"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total==NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0
)
def active_learning_loop(records, ctx):

    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    ctx.query_params["batch_id"] += 1
    new_records = [
        rb.TextClassificationRecord(
            text=trec["train"]["text"][idx],
            metadata={"batch_id": ctx.query_params["batch_id"]},
            id=idx,
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Rubrix
    rb.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )
    ACCURACIES.append(accuracy)
    print("Done!")

    print("Waiting for annotations ...")

πŸ“– New docs!

https://rubrix.readthedocs.io/

🧱 extend_matrix: Weak label augmentation using embeddings

This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

Features

Bug Fixes

Documentation

  • #1512: change theme to furo (#1564, #1604) (98869d2), closes #1512
  • add 'how to prepare your data for training' to basics (#1589) (a21bcf3)
  • add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421
  • Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)
  • add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

You can see all work included in the release here

  • chore(docs): remove by @frascuchon
  • docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo
  • docs(#1512): change theme to furo (#1564, #1604) by @frascuchon
  • chore: set version by @frascuchon
  • feat(token-class): adjust token spans spaces (#1599) by @frascuchon
  • feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon
  • docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo
  • test: configure numpy to disable multi threading (#1593) by @frascuchon
  • docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo
  • feat(#1561): standardize icons (#1565) by @leiyre
  • Feat: Improve from datasets (#1567) by @dcfidalgo
  • feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo
  • docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon
  • refactor: remove words references in searches (#1571) by @frascuchon
  • ci: check conda env cache (#1570) by @frascuchon
  • fix(#1264): discard first space after a token (#1591) by @frascuchon
  • ci(package): regenerate view snapshot (#1600) by @frascuchon
  • fix(#1574): search highlighting for a single dot (#1592) by @leiyre
  • fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre
  • fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon
  • fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon
  • fix: compatibility with new dataset version (#1566) by @dcfidalgo
  • fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre
  • fix(#1545): highlight words with accents (#1550) by @leiyre
argilla - v0.15.0

Published by frascuchon over 2 years ago

0.15.0 (2022-06-08)

πŸ”† Highlights

🏷️ Configure datasets with a labeling scheme

You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

import rubrix as rb

# Define labeling schema
settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])

# Apply seetings to a new or already existing dataset
rb.configure_dataset(name="my_dataset", settings=settings)

# Logging to the newly created dataset triggers the validation checks
rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
#BadRequestApiError: Rubrix server returned an error with http status: 400

Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

🧱 Weak label matrix augmentation using embeddings

You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

πŸ›οΈ Tutorial Gallery

Tutorials are now organized into different categories and with a new gallery design!

Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

🏁 Basics guide

This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

Features

  • #1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134
  • #1432: configure datasets with a label schema (21e48c0), closes #1432
  • #1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446
  • #1460: include text hyphenation (#1469) (ec23b2d), closes #1460
  • #1463: change icon position in table header (#1473) (5172324), closes #1463
  • #1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467
  • configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)
  • UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

Bug Fixes

  • #1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522
  • #1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527
  • #1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532
  • #1533: restrict highlighted fields (3a8b8a9), closes #1533
  • #1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)
  • #1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539
  • metrics: compute f1 for text classification (#1530) (147d38a)
  • search: highlight only textual input fields (8b83a82), closes #1538 #1544

New contributors

@RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413

argilla -

Published by frascuchon over 2 years ago

0.14.2 (2022-05-31)

Bug Fixes

  • #1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514
  • #1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516
  • #1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517
  • #1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518
  • UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)
argilla -

Published by frascuchon over 2 years ago

0.14.1 (2022-05-20)

Bug Fixes

  • #1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447
  • #1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472
  • #1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)
  • #1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503
  • documentation: fix user management guide (#1511) (63f7bee), closes #1501
  • filters: sort filter values by count (#1488) (0987167), closes #1484
argilla - πŸŽ‰ 0.14.0

Published by frascuchon over 2 years ago

0.14.0 (2022-05-10)

Async version of rb.log

You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput
from bentoml.frameworks.spacy import SpacyModelArtifact

import rubrix as rb

import spacy

nlp = spacy.load("en_core_web_sm")


@env(infer_pip_packages=True)
@artifacts([SpacyModelArtifact("nlp")])
class SpacyNERService(BentoService):

    @api(input=JsonInput(), batch=True)
    def predict(self, parsed_json_list):
        result, rb_records = ([], [])
        for index, parsed_json in enumerate(parsed_json_list):
            doc = self.artifacts.nlp(parsed_json["text"])
            prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents]
            rb_records.append(
                rb.TokenClassificationRecord(
                    text=doc.text,
                    tokens=[t.text for t in doc],
                    prediction=[
                        (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
                    ],
                )
            )
            result.append(prediction)

        rb.log(
            name="monitor-for-spacy-ner",
            records=rb_records,
            tags={"framework": "bentoml"},
            background=True,
            verbose=False
        ) # By using the background=True, the model latency won't be affected

        return result

Confidence scores in Token Classification (NER)

To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

import rubrix as rb

text = "Rubrix is a data science tool"

record = rb.TokenClassificationRecord(
    text=text, 
    tokens=text.split(" "), 
    prediction=[("PRODUCT",  0, 6, 0.99)]
)

rb.log(record, "ner_with_scores")

Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

If you want to see this in action, check this blog post by David Berenstein:

https://www.rubrix.ml/blog/concise-concepts-rubrix/

Rule metrics sidebar

We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

This sidebar should help you quickly understand your progress:

See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

Features

Bug Fixes

argilla -

Published by frascuchon over 2 years ago

0.13.3 (2022-04-27)

Bug Fixes

Package Rankings
Top 1.37% on Pypi.org
Related Projects