Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
APACHE-2.0 License
Bot releases are visible (Hide)
Published by frascuchon over 1 year ago
We have added a Settings page for your datasets. From there, you will be able to manage your dataset. Currently, it is possible to add labels to your labeling schema and delete the dataset.
The image in this record was generated using https://robohash.org
You can pass a URL in the metadata field _image_url
and the image will be rendered in the Argilla UI. You can use this in the Text Classification and the Token Classification tasks.
Apart from the _image_url
field you can also pass other metadata fields that won't be used in queries or filters by adding an underscore at the start e.g. _my_field
.
rg.load
You can now specify the fields you want to load from your Argilla dataset. That way, you can avoid loading heavy vectors if you're using them for your annotations.
Check out our new tutorials created by the community!
All notable changes to this project will be documented in this file. See standard-version for commit guidelines.
rg.load
takes too long because of the vector field, even when users don't need it. Closes #2398
_image_url
<v1.3.0
<1.3.0
Published by frascuchon over 1 year ago
A more stylish banner for available global actions. It includes an improved label selector to apply and remove labels in bulk.
We enhanced multi-label text classification annotations and now adding labels in bulk doesn't remove previous labels. This action will change the status of the records to Pending and you will need to validate the annotation to save the changes.
Learn more about bulk annotations and multi-level text classification annotations in our docs.
New actions to clear all annotations and reset changes. They can be used at the record level or as bulk actions.
Click the Validate or Discard buttons in a record to undo this action.
Improved view for a single record to enable a more focused annotation experience.
Extended support to prepare Text2Text datasets for training with SparkNLP.
Learn more in our docs.
In token classification tasks that have 10+ options, labels get assigned QWERTY keys as shortcuts.
All notable changes to this project will be documented in this file. See standard-version for commit guidelines.
configure_dataset
accepts a workspace as argument (#2503) (29c9ee3),active_client
function to main argilla module (#2387) (4e623d4), closes #2183
rg.log
or rg.load
(#2425) (b3b897a), closes #2059
chunk_size
in favor of batch_size
for rg.log
(#2455) (3ebea76), closes #2453
batch_size
parameter for rg.load
(#2460) (e25be3e), closes #2454 #2434
Published by frascuchon over 1 year ago
quickstart: change default api key for the argilla quickstart image (#2357) (bb14f3c)
Resolve errors found in prepare_for_training
during autotrain
integration (https://github.com/argilla-io/argilla/pull/2411)
Closes https://github.com/argilla-io/argilla/issues/2406
Closes https://github.com/argilla-io/argilla/issues/2407
Closes https://github.com/argilla-io/argilla/issues/2408
Closes https://github.com/argilla-io/argilla/issues/2405
Published by frascuchon over 1 year ago
Most important keywords in the dataset or a subset (using the query param) can be retrieved from Python. This can be useful for EDA and defining programmatic labeling rules:
from argilla.metrics.commons import keywords
summary = keywords(name="example-dataset")
summary.visualize() # will plot an histogram with results
summary.data # returns the raw result data
Added a new framework sparknlp
and extended the support for spacy
including text classification datasets. Check out this section of the docs
You can pass train_size
and test_size
to prepare_for_training
to get train-test splits. This is especially useful for spaCy. Check out this section of the docs
When using the Python client now you get a human-readable visualization of Dataset
and Rule
entities
All notable changes to this project will be documented in this file. See standard-version for commit guidelines.
prepare_for_training
methods (#2225) (e53c201), closes #2154 #2132 #2122 #2045 #1697
to know more
into to learn more
in Quickstart login page (#2305) (6082a26)rubrix.apikey
with argilla.apikey
(#2286) (4871127), closes #2254 #2254
Published by frascuchon over 1 year ago
ujson
for client actions (#2211) (920213e)Published by frascuchon almost 2 years ago
Since 1.2.0 Argilla supports adding vectors to Argilla records which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based).
You can now find all record details and fields which can be useful for bookmarking, copy/pasting, and making ES queries
You can now see the timestamp associated with the record timestamp (event timestamp) which corresponds to the moment when the record was uploaded or a custom timestamp passed when logging the data (e.g., when the prediction was made when using it for monitoring)
dataset_labels
metric processing (#1978) (1c3235e), closes #1818
httpx
async client instance (#1958) (a70cb6c), closes #1886
Published by frascuchon almost 2 years ago
You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.
# Read a file with keywords or phrases
labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")
# Create rules
predefined_labeling_rules = []
for index, row in labeling_rules_df.iterrows():
predefined_labeling_rules.append(
Rule(row["query"], row["label"])
)
# Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules
You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels
Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes
top_k_mentions
metrics instead of entity_consistency
(#1880) (42f702d), closes #1834
users.vue
(#1915) by @bengsoonPublished by frascuchon almost 2 years ago
Published by frascuchon almost 2 years ago
Published by frascuchon about 2 years ago
When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens.
Before this release, it was difficult to understand and fix these errors because validation happened on the server side.
With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.
For example, the following record:
import rubrix as rb
rb.TokenClassificationRecord(
tokens=["I", "love", "Paris"],
text="I love Paris!",
prediction=[("LOC",7,13)]
)
Will give you the following error message:
ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [Paris!] defined in ...love Paris!
Tokens:
['I', 'love', 'Paris']
Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:
import rubrix as rb
## Delete by id
rb.delete_records(name="example-dataset", ids=[1,3,5])
## Discard records by query
rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)
We have two new tutorials!
Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html
Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html
https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html
small-text
(#1726) (909efdf), closes #1693
Published by frascuchon about 2 years ago
prepare_for_training
is a method that prepares a dataset for training. Before prepare_for_training
prepared the data for easily training Hugginface Transformers.
Now, you can prepare your training data for spaCy
NER pipelines, thanks to our great community contributor @ignacioct !
With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.
import spacy
import rubrix as rb
from datasets import load_dataset
# Load annotated dataset from Rubrix
rb_dataset = rb.load("ner_dataset")
# Loading an spaCy blank language model to create the Docbin, as it works faster
nlp = spacy.blank("en")
# After this line, the file will be stored in disk
rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")
You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin
Before this release, the rb.load
method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.
Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id
parameter in the rb.load
method.
An example of reading the first 1000 records and the next batch of up to 1000 records:
import rubrix as rb
dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)
The reference to the rb.load
method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load
Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.
Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !
Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!
rb.load
fetch records in batches passing the from_id
argument (3e6344a)prepare_for_training
supports spacy (#1635) (8587808)httpx
client (#1640) (854a972), closes #1646
DocBin
cookbook (#1642) (bb98278), closes #420
rb.load
fetch records in batches passing the from_id
argument by @maxserrashttpx
client (#1640) by @frascuchonDocBin
cookbook (#1642) by @ignacioctPublished by frascuchon over 2 years ago
Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.
You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners
We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:
from rubrix.listeners import listener
from sklearn.metrics import accuracy_score
# Define some helper variables
LABEL2INT = trec["train"].features["label-coarse"].str2int
ACCURACIES = []
# Set up the active learning loop with the listener decorator
@listener(
dataset=DATASET_NAME,
query="status:Validated AND metadata.batch_id:{batch_id}",
condition=lambda search: search.total==NUM_SAMPLES,
execution_interval_in_seconds=3,
batch_id=0
)
def active_learning_loop(records, ctx):
# 1. Update active learner
print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
y = np.array([LABEL2INT(rec.annotation) for rec in records])
# initial update
if ctx.query_params["batch_id"] == 0:
indices = np.array([rec.id for rec in records])
active_learner.initialize_data(indices, y)
# update with the prior queried indices
else:
active_learner.update(y)
print("Done!")
# 2. Query active learner
print("Querying new data points ...")
queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
ctx.query_params["batch_id"] += 1
new_records = [
rb.TextClassificationRecord(
text=trec["train"]["text"][idx],
metadata={"batch_id": ctx.query_params["batch_id"]},
id=idx,
)
for idx in queried_indices
]
# 3. Log the batch to Rubrix
rb.log(new_records, DATASET_NAME)
# 4. Evaluate current classifier on the test set
print("Evaluating current classifier ...")
accuracy = accuracy_score(
dataset_test.y,
active_learner.classifier.predict(dataset_test),
)
ACCURACIES.append(accuracy)
print("Done!")
print("Waiting for annotations ...")
https://rubrix.readthedocs.io/
extend_matrix
: Weak label augmentation using embeddingsThis release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html
words
references in searches (#1571) by @frascuchonPublished by frascuchon over 2 years ago
You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.
import rubrix as rb
# Define labeling schema
settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])
# Apply seetings to a new or already existing dataset
rb.configure_dataset(name="my_dataset", settings=settings)
# Logging to the newly created dataset triggers the validation checks
rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
#BadRequestApiError: Rubrix server returned an error with http status: 400
Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html
You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.
Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html
Tutorials are now organized into different categories and with a new gallery design!
Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html
This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.
Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html
predicted
computation (#1528) (2f2ee2e), closes #1527
@RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413
None
and change default value to 0.0 (#1521) (0a02c70), closes #1514
Published by frascuchon over 2 years ago
rb.log
You can now use the parameter background
in the rb.log
method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):
from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput
from bentoml.frameworks.spacy import SpacyModelArtifact
import rubrix as rb
import spacy
nlp = spacy.load("en_core_web_sm")
@env(infer_pip_packages=True)
@artifacts([SpacyModelArtifact("nlp")])
class SpacyNERService(BentoService):
@api(input=JsonInput(), batch=True)
def predict(self, parsed_json_list):
result, rb_records = ([], [])
for index, parsed_json in enumerate(parsed_json_list):
doc = self.artifacts.nlp(parsed_json["text"])
prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents]
rb_records.append(
rb.TokenClassificationRecord(
text=doc.text,
tokens=[t.text for t in doc],
prediction=[
(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
],
)
)
result.append(prediction)
rb.log(
name="monitor-for-spacy-ner",
records=rb_records,
tags={"framework": "bentoml"},
background=True,
verbose=False
) # By using the background=True, the model latency won't be affected
return result
To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score)
. Let's see an example:
import rubrix as rb
text = "Rubrix is a data science tool"
record = rb.TokenClassificationRecord(
text=text,
tokens=text.split(" "),
prediction=[("PRODUCT", 0, 6, 0.99)]
)
rb.log(record, "ner_with_scores")
Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:
If you want to see this in action, check this blog post by David Berenstein:
https://www.rubrix.ml/blog/concise-concepts-rubrix/
We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.
This sidebar should help you quickly understand your progress:
See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html