Bot releases are visible (Hide)

argilla -

Published by frascuchon over 1 year ago

1.2.2

Bug Fixes

Copying datasets between workspaces with proper owner/workspace info. Closes #2562
Copy dataset with empty workspace to the default user workspace 905d4de
Using elasticsearch config to request backend version. Closes #2311

argilla - v.1.5.0

Published by frascuchon over 1 year ago

🔆 Highlights

Dataset Settings page

We have added a Settings page for your datasets. From there, you will be able to manage your dataset. Currently, it is possible to add labels to your labeling schema and delete the dataset.

Add images to your records

The image in this record was generated using https://robohash.org

You can pass a URL in the metadata field _image_url and the image will be rendered in the Argilla UI. You can use this in the Text Classification and the Token Classification tasks.

Non-searchable metadata fields

Apart from the _image_url field you can also pass other metadata fields that won't be used in queries or filters by adding an underscore at the start e.g. _my_field.

Load only what you need using `rg.load`

You can now specify the fields you want to load from your Argilla dataset. That way, you can avoid loading heavy vectors if you're using them for your annotations.

Two new tutorials (kudos @embonhomme & @burtenshaw)

Check out our new tutorials created by the community!

Compare the performance of two text classification models here
Multimodal bulk annotation here

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.5.0 - 2023-03-21

Added

Add the fields to retrieve when loading the data from argilla. rg.load takes too long because of the vector field, even when users don't need it. Closes #2398
Add new page and components for dataset settings. Closes #2442
Add ability to show image in records (for TokenClassification and TextClassification) if an URL is passed in metadata with the key _image_url
Non-searchable fields support in metadata. #2570

Changed

Labels are now centralized in a specific vuex ORM called GlobalLabel Model, see https://github.com/argilla-io/argilla/issues/2210. This model is the same for TokenClassification and TextClassification (so both task have labels with color_id and shortcuts parameters in the vuex ORM)
The shortcuts improvement for labels #2339 have been moved to the vuex ORM in dataset settings feature #2444
Update "Define a labeling schema" section in docs.
The record inputs are sorted alphabetically in UI by default. #2581

Fixes

Allow URL to be clickable in Jupyter notebook again. Closes #2527

Removed

Removing some data scan deprecated endpoints used by old clients. This change will break compatibility with client <v1.3.0
Stop using old scan deprecated endpoints in python client. This logic will break client compatibility with server version <1.3.0
Remove the previous way to add labels through the dataset page. Now labels can be added only through dataset settings page.

As always, thanks to our amazing contributors!

Documentation update: tutorial for text classification models comparison (#2426) by @embonhomme
Docs: fix little typo (#2522) by @anakin87
Docs: Tutorial on image classification (#2420) by @burtenshaw

argilla - v1.4.0

Published by frascuchon over 1 year ago

🔆 Highlights

Enhanced annotation flow for all tasks

Improved bulk annotation and actions

A more stylish banner for available global actions. It includes an improved label selector to apply and remove labels in bulk.

We enhanced multi-label text classification annotations and now adding labels in bulk doesn't remove previous labels. This action will change the status of the records to Pending and you will need to validate the annotation to save the changes.

Learn more about bulk annotations and multi-level text classification annotations in our docs.

Clear and Reset actions

New actions to clear all annotations and reset changes. They can be used at the record level or as bulk actions.

Unvalidate and undiscard

Click the Validate or Discard buttons in a record to undo this action.

Optimized one-record view

Improved view for a single record to enable a more focused annotation experience.

Prepare for training for SparkNLP Text2Text

Extended support to prepare Text2Text datasets for training with SparkNLP.

Learn more in our docs.

Extended shortcuts for token classification (kudos @cceyda)

In token classification tasks that have 10+ options, labels get assigned QWERTY keys as shortcuts.

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.4.0 (2023-03-09)

Features

configure_dataset accepts a workspace as argument (#2503) (29c9ee3),
Add active_client function to main argilla module (#2387) (4e623d4), closes #2183
Add text2text support for prepare for training spark nlp (#2466) (21efb83), closes #2465 #2482
Allow passing workspace as client param for rg.log or rg.load (#2425) (b3b897a), closes #2059
Bulk annotation improvement (#2437) (3fce915), closes #2264
Deprecate chunk_size in favor of batch_size for rg.log (#2455) (3ebea76), closes #2453
Expose batch_size parameter for rg.load (#2460) (e25be3e), closes #2454 #2434
Extend shortcuts to include alphabet for token classification (#2339) (4a92b35)

Bug Fixes

added flexible app redirect to docs page (#2428) (5600301), closes #2377
added regex match to set workspace method (#2427) (d789fa1), closes [#2388]
error when loading record with empty string query (#2429) (fc71c3b), closes #2400 #2303
Remove extra-action dropdown state after navigation (#2479) (9328994), closes #2158

Documentation

Add AutoTrain to readme (7199780)
Add migration to label schema section (#2435) (d57a1e5), closes #2003 #2003
Adds zero+few shot tutorial with SetFit (#2409) (6c679ad)
Update readme with quickstart section and new links to guides (#2333) (91a77ad)

As always, thanks to our amazing contributors!

Documentation update: adding missing n (#2362) by @Gnonpi
feat: Extend shortcuts to include alphabet for token classification (#2339) by @cceyda

argilla - v1.3.1

Published by frascuchon over 1 year ago

1.3.1 (2023-02-24)

Bug Fixes

quickstart: change default api key for the argilla quickstart image (#2357) (bb14f3c)
Resolve errors found in prepare_for_training during autotrain integration (https://github.com/argilla-io/argilla/pull/2411)
Closes https://github.com/argilla-io/argilla/issues/2406
Closes https://github.com/argilla-io/argilla/issues/2407
Closes https://github.com/argilla-io/argilla/issues/2408
Closes https://github.com/argilla-io/argilla/issues/2405

Documentation

Add section from empty workspaces migration (#2382) (d0f8882), Refs #2373

argilla - v1.3.0

Published by frascuchon over 1 year ago

🔆 Highlights

Keyword metric from Python client

Most important keywords in the dataset or a subset (using the query param) can be retrieved from Python. This can be useful for EDA and defining programmatic labeling rules:

from argilla.metrics.commons import keywords
summary = keywords(name="example-dataset")
summary.visualize() # will plot an histogram with results
summary.data # returns the raw result data

Prepare for training for SparkNLP and spaCy text-cat

Added a new framework sparknlp and extended the support for spacy including text classification datasets. Check out this section of the docs

Create train and test split with prepare_for_training

You can pass train_size and test_size to prepare_for_training to get train-test splits. This is especially useful for spaCy. Check out this section of the docs

Better repr for Dataset and Rule (kudos @Ankush-Chander)

When using the Python client now you get a human-readable visualization of Dataset and Rule entities

Changelog

All notable changes to this project will be documented in this file. See standard-version for commit guidelines.

1.3.0 (2023-02-09)

Features

better log error handling (#2245) (66e5cce), closes #2005
Change view mode order in sidebar (#2215) (dff1ea1), closes #2214
Client: Expose keywords dataset metrics (#2290) (a945c5e), closes #2135
Client: relax client constraints for rules management (#2242) (6e749b7), closes #2048
Create a multiple contextual help component (#2255) (a35fae2), closes #1926
Include record event_timestamp (#2156) (3992b8f), closes #1911
updated the prepare_for_training methods (#2225) (e53c201), closes #2154 #2132 #2122 #2045 #1697

Bug Fixes

Client: formatting caused offset in prediction (#2241) (d65db5a)
Client: Log remaining data when shutdown the dataset consumer (#2269) (d78963e), closes #2189
validate predictions fails on text2text (#2271) (f68856e), closes #2252

Visual enhancements

Fine tune menu record card (#2240) (62148e5), closes #2224
Rely on box-shadow to provide the secondary underline (#2283) (d786171), closes #2282 #2282

Documentation

Add deploy on Spaces buttons (#2293) (60164a0)
fix typo in documentation (#2296) (ab8e85e)
Improve deployment and quickstart docs and tutorials (#2201) (075bf94), closes #2162
More spaces! (#2309) (f02eb60)
Remove cut-off sentence in docs codeblock (#2287) (7e87f20)
Rephrase to know more into to learn more in Quickstart login page (#2305) (6082a26)
Replace leftover rubrix.apikey with argilla.apikey (#2286) (4871127), closes #2254 #2254
Simplify token attributions code block (#2322) (4cb6ae1)
Tutorial buttons (#2310) (d6e02de)
Update colab guide (#2320) (e48a7cc)
Update HF Spaces creation image (#2314) (e4b2a04)

As always, thanks to our amazing contributors!

add repr method for Rule, Dataset. (#2148) by @Ankush-Chander
opensearch docker compose file doesn't run (#2228) by @kayvane1
Docs: fix typo in documentation (#2296) by @anakin87

argilla - v1.2.1

Published by frascuchon over 1 year ago

1.2.1 (2023-01-23)

Bug Fixes

Allow non-alphanumeric characters for login (#2207) (629499a), closes #1879
Client: Stop using ujson for client actions (#2211) (920213e)
doc typos (#2203) (b353a30)
Read statics with proper encoding (#2234) (92739bf), closes #2219
Remove 3.9+ string methods (#2230) (4ed1ff0), closes #2192
Remove argilla:stats in metadata filter (#2218) (a412b22), closes #2217, #2220

argilla - v1.2.0

Published by frascuchon almost 2 years ago

1.2.0 (2023-01-12)

🔆 Highlights

Data labelling and curation with similarity search

Since 1.2.0 Argilla supports adding vectors to Argilla records which can then be used for finding the most similar records to a given one. This feature uses vector or semantic search combined with more traditional search (keyword and filter based).

View record info

You can now find all record details and fields which can be useful for bookmarking, copy/pasting, and making ES queries

View record timestamp

You can now see the timestamp associated with the record timestamp (event timestamp) which corresponds to the moment when the record was uploaded or a custom timestamp passed when logging the data (e.g., when the prediction was made when using it for monitoring)

Configure the base path of your Argilla UI (useful for proxies)

See: https://docs.argilla.io/en/latest/getting_started/installation/server_configuration.html#using-a-proxy

Features

Allow to launch the argilla server in a different base_url (#2080) (63d624d), closes #1914 #1899
Check es connection on startup with retries (#2141) (7a63bea)
enable partial record update (#2118) (4ed0d95)
Improve the dataset_labels metric processing (#1978) (1c3235e), closes #1818
Include record event_timestamp (#2156) (5b75ade), closes #1911
Include record info view and remove metadata filter (#2079) (901d45a), closes #1927 #1849
Raw records scan endpoint (#2102) (1b63d95)
reuse the same httpx async client instance (#1958) (a70cb6c), closes #1886
Search: Allow passing raw es query in search query (#2098) (0541798)
set record timestamp by default (#1970) (309fd9f), closes #1892
Similarity vector search (#1768) (#1998) (32958f4), closes #1757
UI: remove mixins to hide scroll bar in drop down (#2000) (95ad9b8), closes #1928

Bug Fixes

#1912 hide empty menu dropdown (#1981) (d90390b)
Avoid manipulating DOM (#1895) (6939b28), closes #1765
catch ImportError for telemetry module (#1989) (25513b7)
Client: check url underscore only for hostnames (#2185) (ec5726a)
client: prevent python client response json parse error (#2186) (5549ab0)
Compute predicted properly for token classification [REINDEX_DATASET_REF] (#1975) (a29a198), closes #1955
Disable shortcuts for pagination when focus is on an input tag (#1995) (af07f3e), closes #1976
Migration: Set dynamic to false for old indices (#2167) (15a18d7)
Prevent show "No result" before data is loaded (#2014) (0799425), closes #1936

Documentation

Add new tutorial about zeroshot sentiment analysis with GPT-3 (#2011) (d3c43ab)
added additional explanation for datetime ranges (#2120) (c8c3dc9), closes #2119
Adds Hugging Face Space deployment guide (#2109) (a7a47c4)
changed DatasetForTextGeneration to DatasetForText2Text (#2090) (8cde28b), closes #2089
Fix load docstring example (#2050) (7e2af7f), closes #1951
fixed typo errors for terminology section (#2025) (1056736)
include new OG image (#2017) (710ab3f)
Include og image (#2016) (85442e4)
Maintain menu position during navigation (#1935) (82c6e08), closes #1864
New setfit tutorial (#2002) (43c66b2)
Replace OG image (#2018) (894b273)
Replace video with image (#1990) (359b637)
reverted to correct apikey reference (#2136) (f32f2b8), closes #2074

As always, thanks to our amazing contributors!

Add Azure deployment tutorial (#2124) by @burtenshaw
Create training-textclassification-activelearning-with-GPU.ipynb (#2020) by @MoritzLaurer

argilla - v1.1.1

Published by frascuchon almost 2 years ago

1.1.1 (2022-11-29)

Bug Fixes

Set proper telemetry version (#1988) (d302891)

Documentation

Fix metric function imports in the example (#1966) (a1f6f6e), closes #1962

argilla - v1.1.0

Published by frascuchon almost 2 years ago

1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

You can now manage rules programmatically and reflect them in Argilla Datasets so you can iterate on labeling rules from both Python and the UI. This is especially useful for leveraging linguistic resources (such as terminological lists) and making the rules available in the UI for domain experts to refine them.

# Read a file with keywords or phrases
labeling_rules_df = pd.read_csv("../../_static/datasets/weak_supervision_tutorial/labeling_rules.csv")

# Create rules
predefined_labeling_rules = []
for index, row in labeling_rules_df.iterrows():
    predefined_labeling_rules.append(
        Rule(row["query"], row["label"])
    )

# Add the rules to the weak_supervision_yt dataset. The rules will be manageable from the UI
add_rules(dataset="weak_supervision_yt", rules=predefined_labeling_rules

You can find more info about this feature in the deep dive guide: https://docs.argilla.io/en/latest/guides/techniques/weak_supervision.html#3.-Building-and-analyzing-weak-labels

Sort by timestamp fields in the UI

Users can now sort the records by last_updated and other timestamp fields to improve the labeling and review processes

Features

#1929 add warning about using wrong hostnames (#1930) (a3bc554)
Add, delete and edit labeling rules from Python client (#1884) (d534a29), closes #1855
Added more explicit error message regarding dataset name validation (#1933) (c25a225), closes #1931 #1918
Allow sort records by event_timestamp or last_updated fields (#1924) (1c08c36), closes #1835
Create a contextual help to support the user in the different dataset views (#1913) (8e3851e)
Enable metadata length field config by environment variable (#1923) (0ff2de7), closes #1761
Update error page (#1932) (caeb7d4), closes #1894
Using new top_k_mentions metrics instead of entity_consistency (#1880) (42f702d), closes #1834

Bug Fixes

Avoid closing the score filter when dragging the slider (#1822) (91a72c5), closes #1802
Change method for Doc creation by spacy.Language (#1891) (6264983), closes #1890
DAO: datasets dao filter datasets by tasks (#1934) (937b410)
docker: Prevent wrong elastic server for wait-for-it (c6a10c7)
Improve access to label list in Text Classification (#1916) (24729bd), closes #1804
Improve explanation readability (#1815) (52c712e), closes #1774
Monitoring: Serializable log middleware (#1908) (53a57f7)
Move "Show less" button to the end of entities list (#1875) (6d796a4), closes #1779
Remove "Help explain button" in Manage rule view (#1909) (8bc70b0), closes #1807
Remove extra html when text is not highlighted (#1904) (7858dc5), closes #1758
Remove extra type when highlighting the query in the text (#1863) (341c581), closes #1758

Documentation

change iframe for mp4 (dfac8b2)
corrected for iframe (935f586)
Link key features (#1805) (#1809) (4c83604)
resolved miss-direction and old naming in README.md (f45fe1e)
Update README links linkedin and twitter (#1797) (2d4d03a)

As always, thanks to our amazing contributors!

docs: Link key features (#1805) (#1809) by @chschroeder
View Docs link in frontend header users.vue (#1915) by @bengsoon
fix: Change method for Doc creation by spacy.Language (#1891) by @jamnicki

argilla - v1.0.1

Published by frascuchon almost 2 years ago

1.0.1 (2022-11-04)

Bug Fixes

Remove the extra letter "y" (#1814) (f3d5d2e), closes #1811
Update vue-virtual-scroller dependency version (#1813) (147dc8d), closes #1806 #1782 #1816

Documentation

corrected for tutorial and api redirections (#1820) (26ccdcc)

argilla - v0.19.0

Published by frascuchon almost 2 years ago

argilla - v0.18.0

Published by frascuchon about 2 years ago

0.18.0 (2022-10-05)

⚡ Highlights

Better validation of token classification records

When working with Token Classification records, there are very often misalignment problems between the entity spans and provided tokens.
Before this release, it was difficult to understand and fix these errors because validation happened on the server side.

With this release, records are validated during instantiation, giving you a clear error message which can help you to fix/ignore problematic records.

For example, the following record:

import rubrix as rb

rb.TokenClassificationRecord(
    tokens=["I", "love", "Paris"],
    text="I love Paris!",
    prediction=[("LOC",7,13)]
)

Will give you the following error message:

ValueError: Following entity spans are not aligned with provided tokenization
Spans:
- [Paris!] defined in ...love Paris!
Tokens:
['I', 'love', 'Paris']

Delete records by query

Now it's possible to delete specific records, either by ids or by a query using Lucene's syntax. This is useful for clean up and better dataset maintenance:

import rubrix as rb

## Delete by id
rb.delete_records(name="example-dataset", ids=[1,3,5])

## Discard records by query
rb.delete_records(name="example-dataset", query="metadata.code=33", discard_only=True)

New tutorials

We have two new tutorials!

Few-shot classification with SetFit and a custom dataset: https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Analyzing predictions with model explainability methods: https://rubrix.readthedocs.io/en/stable/tutorials/nlp_model_explainability.html
https://rubrix.readthedocs.io/en/stable/tutorials/few-shot-classification-with-setfit.html

Features

API: provide a dict for record annotations/predictions (#1658) (12b0f83)
Client: expose client extra headers in init function (#1715) (79f0529), closes #1706
Client: improve httpx errors handling (#1662) (85da336)
Client: validate token classification annotations in client (#1709) (936d1ca), closes #1579
Datasets: delete records by query (#1721) (bc9685d), closes #1714 #1737
Datasets: restrict dataset deletion only to creators and super-users (#1713) (c1bef9d), closes #1740
Server: Add server telemetry (#1687) (d7cc006)

Bug Fixes

'MajorityVoter.score' when using multi-labels (#1678) (0b94c86), closes #1628
Metadata limits: exclude subfields from mappings (#1700) (9f9650e), closes #1699
Normalizes the UnauthorizationError for the API response (#1748) (6a68048)
Search tag reset prior annotation (#1736) (dc0a17f), closes #1711

Visual enhancements

Align App UI with the design system (#1672) (67d6de8), closes #1670

Documentation

Add interpret tutorial with Transformers (#1728) (c3fa079), closes #1729
Adds tutorial about custom few-shot classification with SetFit (#1739) (4f15ee6), closes #1741
fixing the active learning tutorial with small-text (#1726) (909efdf), closes #1693
raise small-text version to 1.1.0 and adapt tutorial (#1744) (16f19b7), closes #1693
Resolve many typos in documentation, comments and tutorials (#1701) (f05e1c1)
using official token class. mapper since is compatible now (#1738) (e82fd13), closes #482

As always, thanks to our amazing contributors!

refactor: accept flat text as input for token classification mapper (#1686) by @Ankush-Chander
feat(Client): improve httpx errors handling (#1662) by @Ankush-Chander
fix: 'MajorityVoter.score' when using multi-labels (#1678) by @dcfidalgo
docs: raise small-text version to 1.1.0 and adapt tutorial (#1744) by @chschroeder
refactor: Incompatible attribute type fixed (#1675) by @luca-digrazia
docs: Resolve many typos in documentation, comments and tutorials (#1701) by @tomaarsen
refactor: Collection of changes, primarily regarding test suite and its coverage (#1702) by @tomaarsen

argilla - v0.17.0

Published by frascuchon about 2 years ago

0.17.0 (2022-08-22)

⚡ Highlights

Preparing a training set in the spaCy DocBin format

prepare_for_training is a method that prepares a dataset for training. Before prepare_for_training prepared the data for easily training Hugginface Transformers.

Now, you can prepare your training data for spaCy NER pipelines, thanks to our great community contributor @ignacioct !

With the example below, you can export your Rubrix dataset into a Docbin, save it to disk, and then use it with the spacy train command.

import spacy
import rubrix as rb

from datasets import load_dataset

# Load annotated dataset from Rubrix
rb_dataset = rb.load("ner_dataset")

# Loading an spaCy blank language model to create the Docbin, as it works faster
nlp = spacy.blank("en")

# After this line, the file will be stored in disk
rb_dataset.prepare_for_training(framework="spacy", lang=nlp).to_disk("train.spacy")

You can find a full example at: https://rubrix.readthedocs.io/en/v0.17.0/guides/cookbook.html#Train-a-spaCy-model-by-exporting-to-Docbin

Load large datasets using batches

Before this release, the rb.load method to read datasets from Python retrieved the full dataset. For large datasets, this could cause high memory consumption, network timeouts, and the inability to read datasets larger than the available memory.

Thanks to the awesome work by @maxserras. Now it's possible to optimize memory consumption and avoid network timeouts when working with large datasets. To that end, a simple batch-iteration over the whole database can be done employing the from_id parameter in the rb.load method.

An example of reading the first 1000 records and the next batch of up to 1000 records:

import rubrix as rb
dataset_batch_1 = rb.load(name="example-dataset", limit=1000)
dataset_batch_2 = rb.load(name="example-dataset", limit=1000, id_from=dataset_batch_1[-1].id)

The reference to the rb.load method can be found at: https://rubrix.readthedocs.io/en/v0.17.0/reference/python/python_client.html#rubrix.load

Larger pagination sizes for faster bulk review and annotation

Using filters and search for data annotation and review, some users are able to filter and quickly review dozens of records in one go. To serve those users, it's now possible to see and bulk annotate 50 and 100 records in each page.

Copy record text to clipboard

Sometimes is useful to copy the text in records to use inspect it or process it with another application. Now, this is possible thanks to the feature request by our great community member and contributor @Ankush-Chander !

Better error logging for generic errors

Thanks to work done by @Ankush-Chander and @frascuchon we now have more meaningful messages for generic server errors!

Features

Add new pagination size ranges (#1667) (5b4f1f2), closes #1578
Allow rb.load fetch records in batches passing the from_id argument (3e6344a)
Copy to clipboard the record text (#1625) (d634a7b), closes #1616
Error Logging: send error detail in response for generic server errors (#1648) (ad17631)
Listeners: allow using query params in the condition through search parameter (#1627) (a0a245d), closes #1622
prepare_for_training supports spacy (#1635) (8587808)

Bug Fixes

Client: reusing the inner httpx client (#1640) (854a972), closes #1646
docker-compose.yaml: default volume and disable disk threshold (#1656) (05ae688), closes #1275
Encode rule name in Weak Labeling API requests (#1649) (4634df8), closes #1645
handle stream api connection errors gracefully (#1636) (a106ec4), closes #1559
Update progress bar when refreshing after adding new records (#1666) (7e0d915), closes #1590

Documentation

Add Slack support link in README's get started (#1688) (bef010c)
Adding Elasticsearch persistence to docker compose section (#1643) (ecdc854)
spacy DocBin cookbook (#1642) (bb98278), closes #420

Visual enhancements

Small visual adjustments for Text2Text record card (#1632) (9c87cf1), closes #1138
Improve card spacing (#1638) (fd4016a), closes #1624

You can see all work included in the release here

fix: Update progress bar when refreshing after adding new records (#1666) by @leiyre
chore: configure miniconda for readthedocs builder by @frascuchon
style: Small visual adjustments for Text2Text record card (#1632) by @leiyre
feat: Copy to clipboard the record text (#1625) by @leiyre
docs: Add Slack support link in README's get started (#1688) by @dvsrepo
chore: update version by @frascuchon
feat: Add new pagination size ranges (#1667) by @leiyre
fix: handle stream api connection errors gracefully (#1636) by @Ankush-Chander
feat: allow rb.load fetch records in batches passing the from_id argument by @maxserras
fix(Client): reusing the inner httpx client (#1640) by @frascuchon
feat(Error Logging): send error detail in response for generic server errors (#1648) by @frascuchon
docs: spacy DocBin cookbook (#1642) by @ignacioct
feat: prepare_for_training supports spacy (#1635) by @frascuchon
style: Improve card spacing (#1638) by @leiyre
docs: Adding Elasticsearch persistence to docker compose section (#1643) by @maxserras
chore: remove old rubrix client class (#1639) by @frascuchon
feat(Listeners): allow using query params in the condition through search parameter (#1627) by @frascuchon
doc: show metric graphs in documentation (#1669) by @leiyre
fix(docker-compose.yaml): default volume and disable disk threshold (#1656) by @frascuchon
fix: Encode rule name in Weak Labeling API requests (#1649) by @leiyre

argilla - v0.16.1

Published by frascuchon over 2 years ago

0.16.1 (2022-07-22)

Bug Fixes

'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) (3cb4c07), closes #1631
Display metadata in Text2Text dataset (#1626) (0089e0a), closes #1623
Show predicted OK/KO when predictions exist (#1620) (ef66e9c), closes #1619

Documentation

Fix typo in Getting Started -> Concepts (#1618) (b236cb8), closes #1617

You can see all work included in the release here

fix: 'WeakMultiLabels.summary' and 'show_records' after extending the weak label matrix (#1633) by @dcfidalgo
fix: Display metadata in Text2Text dataset (#1626) by @leiyre
chore: set version by @dcfidalgo
docs: Fix typo in Getting Started -> Concepts (#1618) by @dcfidalgo
fix: Show predicted OK/KO when predictions exist (#1620) by @leiyre

argilla -

Published by frascuchon over 2 years ago

0.16.0 (2022-07-08)

Highlights

👂 Listeners: enable more interactive workflows between client and server

Listeners enable you to define functions that get executed under certain conditions when something changes in a dataset. There are many use cases for this: monitoring annotation jobs, monitoring model predictions, enabling active learning workflows, and many more.

You can find the Python API reference docs here: https://rubrix.readthedocs.io/en/stable/reference/python/python_listeners.html#python-listeners

We will be documenting these use cases with practical examples, but for this release, we've included a new tutorial for using this with active learning: https://rubrix.readthedocs.io/en/stable/tutorials/active_learning_with_small_text.html. This tutorial includes the following listener function, which implements the active learning loop:

from rubrix.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = trec["train"].features["label-coarse"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total==NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0
)
def active_learning_loop(records, ctx):

    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    ctx.query_params["batch_id"] += 1
    new_records = [
        rb.TextClassificationRecord(
            text=trec["train"]["text"][idx],
            metadata={"batch_id": ctx.query_params["batch_id"]},
            id=idx,
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Rubrix
    rb.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )
    ACCURACIES.append(accuracy)
    print("Done!")

    print("Waiting for annotations ...")

📖 New docs!

https://rubrix.readthedocs.io/

🧱 `extend_matrix`: Weak label augmentation using embeddings

This release includes an exciting feature to augment the coverage of your weak labels using embeddings. You can find a practical tutorial here: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

Features

#1561: standardize icons (#1565) (15254e7), closes #1561
#1602: new rubrix dataset listeners (#1507, #1586, #1583, #1596) (65747ab), closes #1602
Add 'extend_matrix' to the WeakMultiLabel class (#1577) (cf89311)
Improve from datasets (#1567) (2b0d607)
token-class: adjust token spans spaces (#1599) (0fb3576)

Bug Fixes

#1264: discard first space after a token (#1591) (eff0ac5), closes #1264
#1545: highlight words with accents (#1550) (c42e77b), closes #1545
#1548: access datasets for superusers when workspace is not provided (#1572, #1608) (0b04bc8), closes #1548
#1551: don't show error traces for EntityNotFoundError's (#1569) (04e101c), closes #1551
#1557: allow text editing when clicking the "edit" button (#1558) (e751414), closes #1557
#1574: search highlighting for a single dot (#1592) (53474a1), closes #1574
#1575: show predicted ok/ko in Text Classifier explore mode (#1576) (ada87c0), closes #1575
compatibility with new dataset version (#1566) (ac26e30)

Documentation

#1512: change theme to furo (#1564, #1604) (98869d2), closes #1512
add 'how to prepare your data for training' to basics (#1589) (a21bcf3)
add active learning with small text and listener tutorial (#1585, #1609) (d59573f), closes #1601 #421
Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) (ab481c7)
add pip version and dockertag as parameter in the build process (#1560) (73a31e2)

You can see all work included in the release here

chore(docs): remove by @frascuchon
docs: add active learning with small text and listener tutorial (#1585, #1609) by @dcfidalgo
docs(#1512): change theme to furo (#1564, #1604) by @frascuchon
chore: set version by @frascuchon
feat(token-class): adjust token spans spaces (#1599) by @frascuchon
feat(#1602): new rubrix dataset listeners (#1507, #1586, #1583, #1596) by @frascuchon
docs: add 'how to prepare your data for training' to basics (#1589) by @dcfidalgo
test: configure numpy to disable multi threading (#1593) by @frascuchon
docs: Add MajorityVoter to references + Add comments about multi-label support of the label models (#1582) by @dcfidalgo
feat(#1561): standardize icons (#1565) by @leiyre
Feat: Improve from datasets (#1567) by @dcfidalgo
feat: Add 'extend_matrix' to the WeakMultiLabel class (#1577) by @dcfidalgo
docs: add pip version and dockertag as parameter in the build process (#1560) by @frascuchon
refactor: remove words references in searches (#1571) by @frascuchon
ci: check conda env cache (#1570) by @frascuchon
fix(#1264): discard first space after a token (#1591) by @frascuchon
ci(package): regenerate view snapshot (#1600) by @frascuchon
fix(#1574): search highlighting for a single dot (#1592) by @leiyre
fix(#1575): show predicted ok/ko in Text Classifier explore mode (#1576) by @leiyre
fix(#1548): access datasets for superusers when workspace is not provided (#1572, #1608) by @frascuchon
fix(#1551): don't show error traces for EntityNotFoundError's (#1569) by @frascuchon
fix: compatibility with new dataset version (#1566) by @dcfidalgo
fix(#1557): allow text editing when clicking the "edit" button (#1558) by @leiyre
fix(#1545): highlight words with accents (#1550) by @leiyre

argilla - v0.15.0

Published by frascuchon over 2 years ago

0.15.0 (2022-06-08)

🔆 Highlights

🏷️ Configure datasets with a labeling scheme

You can now predefine and change the label schema of your datasets. This is useful for fixing a set of labels for you and your annotation teams.

import rubrix as rb

# Define labeling schema
settings = rb.TextClassificationSettings(label_schema=["A", "B", "C"])

# Apply seetings to a new or already existing dataset
rb.configure_dataset(name="my_dataset", settings=settings)

# Logging to the newly created dataset triggers the validation checks
rb.log(rb.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
#BadRequestApiError: Rubrix server returned an error with http status: 400

Read the docs: https://rubrix.readthedocs.io/en/stable/guides/dataset_settings.html

🧱 Weak label matrix augmentation using embeddings

You can now use an augmentation technique inspired by https://github.com/HazyResearch/epoxy to augment the coverage of your rules using embeddings (e.g., sentence transformers). This is useful for improving the recall of your labeling rules.

Read the tutorial: https://rubrix.readthedocs.io/en/stable/tutorials/extend_weak_labels_with_embeddings.html

🏛️ Tutorial Gallery

Tutorials are now organized into different categories and with a new gallery design!

Read the docs: https://rubrix.readthedocs.io/en/stable/tutorials/introductory.html

🏁 Basics guide

This is the first version of the basics guide. This guide will show you how to perform the most basic actions with Rubrix, such as uploading data or data annotation.

Read the docs: https://rubrix.readthedocs.io/en/stable/getting_started/basics.html

Features

#1134: Allow extending the weak label matrix with embeddings (#1487) (4d54994), closes #1134
#1432: configure datasets with a label schema (21e48c0), closes #1432
#1446: copy icon position in datasets list (#1448) (7c9fa52), closes #1446
#1460: include text hyphenation (#1469) (ec23b2d), closes #1460
#1463: change icon position in table header (#1473) (5172324), closes #1463
#1467: include animation delay for last progress bar track (#1462) (c772b74), closes #1467
configuraton: add elasticsearch ca_cert path variable (#1502) (f0eda12)
UI: improve access to actions in metadata and sort dropdowns (#1510) (8d33090), closes #1435

Bug Fixes

#1522: dates metadata fields accessible for sorting (#1529) (a576ceb), closes #1522
#1527: check agents instead labels for predicted computation (#1528) (2f2ee2e), closes #1527
#1532: correct domain for filter score histogram (#1540) (7478d6c), closes #1532
#1533: restrict highlighted fields (3a8b8a9), closes #1533
#1534: fix progress in the metrics sidebar when page is refreshed (#1536) (1b572c4)
#1539: checkbox behavior with value 0 (#1541) (7a0ab63), closes #1539
metrics: compute f1 for text classification (#1530) (147d38a)
search: highlight only textual input fields (8b83a82), closes #1538 #1544

New contributors

@RafaelBod made his first contribution in https://github.com/recognai/rubrix/pull/1413

argilla -

Published by frascuchon over 2 years ago

0.14.2 (2022-05-31)

Bug Fixes

#1514: allow ent score None and change default value to 0.0 (#1521) (0a02c70), closes #1514
#1516: restore read-only to copied dataset (#1520) (5b9cf0e), closes #1516
#1517: stop background task when something happens to main thread (#1519) (0304f40), closes #1517
#1518: disable global actions checkbox when no data was found (#1525) (bf35e72), closes #1518
UI: remove selected metadata fields for sortable fields dropdown (#1513) (bb9482b)

argilla -

Published by frascuchon over 2 years ago

0.14.1 (2022-05-20)

Bug Fixes

#1447: change agent when validating records with annotation but default status (#1480) (126e6f4), closes #1447
#1472: hide scrollbar in scrollable components (#1490) (b056e4e), closes #1472
#1483: close global actions "Annotate as" selector after deselect records checkbox (#1485) (a88f8cb)
#1503: Count filter values when loading a dataset with a route query (#1506) (43be9b8), closes #1503
documentation: fix user management guide (#1511) (63f7bee), closes #1501
filters: sort filter values by count (#1488) (0987167), closes #1484

argilla - 🎉 0.14.0

Published by frascuchon over 2 years ago

0.14.0 (2022-05-10)

Async version of `rb.log`

You can now use the parameter background in the rb.log method to log records without blocking the main process. The main use case is monitoring production pipelines to do prediction monitoring. Here's an example with BentoML (you can find the full example in the updated Monitoring guide):

from bentoml import BentoService, api, artifacts, env
from bentoml.adapters import JsonInput
from bentoml.frameworks.spacy import SpacyModelArtifact

import rubrix as rb

import spacy

nlp = spacy.load("en_core_web_sm")


@env(infer_pip_packages=True)
@artifacts([SpacyModelArtifact("nlp")])
class SpacyNERService(BentoService):

    @api(input=JsonInput(), batch=True)
    def predict(self, parsed_json_list):
        result, rb_records = ([], [])
        for index, parsed_json in enumerate(parsed_json_list):
            doc = self.artifacts.nlp(parsed_json["text"])
            prediction = [{"entity": ent.text, "label": ent.label_} for ent in doc.ents]
            rb_records.append(
                rb.TokenClassificationRecord(
                    text=doc.text,
                    tokens=[t.text for t in doc],
                    prediction=[
                        (ent.label_, ent.start_char, ent.end_char) for ent in doc.ents
                    ],
                )
            )
            result.append(prediction)

        rb.log(
            name="monitor-for-spacy-ner",
            records=rb_records,
            tags={"framework": "bentoml"},
            background=True,
            verbose=False
        ) # By using the background=True, the model latency won't be affected

        return result

Confidence scores in Token Classification (NER)

To store entity predictions you can attach a score using the last position of the entity tuple (label, char_start, char_end, score). Let's see an example:

import rubrix as rb

text = "Rubrix is a data science tool"

record = rb.TokenClassificationRecord(
    text=text, 
    tokens=text.split(" "), 
    prediction=[("PRODUCT",  0, 6, 0.99)]
)

rb.log(record, "ner_with_scores")

Then, in the web application, you and your team can use the score filter to find potentially problematic entities, like in the screenshot below:

If you want to see this in action, check this blog post by David Berenstein:

https://www.rubrix.ml/blog/concise-concepts-rubrix/

Rule metrics sidebar

We have a fresh new sidebar for the weak labeling mode, where you can see your overall rule metrics as you define new rules.

This sidebar should help you quickly understand your progress:

See the updated user guide here: https://rubrix.readthedocs.io/en/v0.14.0/reference/webapp/define_rules.html

Features

#1132: introduce async/background version of rb.log (#1391) (900307e), closes #1132
#1247: label models predict method returns DatasetForTextClassification (#1442) (42ca1be), closes #1247
#1379: show prediction score in NER (#1389) (0bdccd2), closes #1379 #1451
#961: rules metrics in sidebar (#1377) (261f53a), closes #961 #1408
home: improve table actions and styles (#1384) (f09746e), closes #1355 #1333

Bug Fixes

#1407: fix visualization in 1024px viewport (#1420) (46f8d4d), closes #1441
#1458: token classifier visualization in Safari (#1459) (01cc492), closes #1458

argilla -

Published by frascuchon over 2 years ago

0.13.3 (2022-04-27)

Bug Fixes

#1248: allow multiple label attributions in UI (#1424) (a9f8363), closes #1248
#1409: filtering by metadata with value list (#1415) (7aca061), closes #1409
#1410: apply dataset name pattern to user name (#1411) (2087c21), closes #1410
#1428: support cleanlab v2 (#1436) (d189ddb), closes #1428
TokenClassification: display characters between tokens words (#1418) (a08cd7b), closes #1414 #1383

Package Rankings

Top 1.37% on Pypi.org

Related Projects

embedJs

A NodeJS RAG framework to easily work with LLMs and embeddings

29 Jun 2023 281

megabots

🤖 State-of-the-art, production ready LLM apps made mega-easy, so you don't have to build them fro...

11 Apr 2023 343

DB-GPT

AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents

13 Apr 2023 10,786

everything-ai

Your fully proficient, AI-powered and local chatbot assistant🤖

03 Apr 2024 217

promptulate

A large language model automation and Autonomous Language Agents development framework.

18 Mar 2023 135

litellm

Call all LLM APIs using the OpenAI format. Use Bedrock, Azure, OpenAI, Cohere, Anthropic, Ollama,...

27 Jul 2023 7,827

R2R

The framework for fast development and deployment of RAG backends.

12 Feb 2024 1,103

BoardRAG

A RAG application feeding on board games running locally. Create a database with your favorite bo...

07 Sep 2024 4

autolabel

Label, clean and enrich text datasets with LLMs.

23 Mar 2023 1,864

codeinterpreter-api

👾 Open source implementation of the ChatGPT Code Interpreter

10 Jul 2023 3,763

langsmith-cookbook

01 Aug 2023 757

evadb

Database system for AI-powered apps

10 Sep 2018 2,589

LibreChat

Enhanced ChatGPT Clone: Features OpenAI, Assistants API, Azure, Groq, GPT-4 Vision, Mistral, Bing...

12 Feb 2023 11,885

entaoai

Chat and Ask on your own data. Accelerator to quickly upload your own enterprise data and use Op...

16 Mar 2023 826

dialoqbase

Create chatbots with ease

04 Jun 2023 1,410

argilla

1.2.2

Bug Fixes

🔆 Highlights

Dataset Settings page

Add images to your records

Non-searchable metadata fields

Load only what you need using rg.load

Two new tutorials (kudos @embonhomme & @burtenshaw)

Changelog

1.5.0 - 2023-03-21

Added

Changed

Fixes

Removed

As always, thanks to our amazing contributors!

🔆 Highlights

Enhanced annotation flow for all tasks

Improved bulk annotation and actions

Clear and Reset actions

Unvalidate and undiscard

Optimized one-record view

Prepare for training for SparkNLP Text2Text

Extended shortcuts for token classification (kudos @cceyda)

Changelog

1.4.0 (2023-03-09)

Features

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

1.3.1 (2023-02-24)

Bug Fixes

Documentation

🔆 Highlights

Keyword metric from Python client

Prepare for training for SparkNLP and spaCy text-cat

Create train and test split with prepare_for_training

Better repr for Dataset and Rule (kudos @Ankush-Chander)

Changelog

1.3.0 (2023-02-09)

Features

Bug Fixes

Visual enhancements

Documentation

As always, thanks to our amazing contributors!

1.2.1 (2023-01-23)

Bug Fixes

1.2.0 (2023-01-12)

🔆 Highlights

Data labelling and curation with similarity search

View record info

View record timestamp

Configure the base path of your Argilla UI (useful for proxies)

Features

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

1.1.1 (2022-11-29)

Bug Fixes

Documentation

1.1.0 (2022-11-24)

Highlights

Add, update, and delete rules from a Dataset using the Python client

Sort by timestamp fields in the UI

Features

Bug Fixes

Documentation

As always, thanks to our amazing contributors!

1.0.1 (2022-11-04)

Bug Fixes

Documentation

0.18.0 (2022-10-05)

⚡ Highlights

Better validation of token classification records

Delete records by query

New tutorials

Features

Bug Fixes

Visual enhancements

Documentation

Load only what you need using `rg.load`

🧱 `extend_matrix`: Weak label augmentation using embeddings

Async version of `rb.log`