Argilla is a collaboration tool for AI engineers and domain experts to build high-quality datasets
APACHE-2.0 License
Bot releases are hidden (Show)
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.29.0...v1.29.1
Published by frascuchon 4 months ago
argilla
: simplify structure for flatten records to list by @frascuchon in https://github.com/argilla-io/argilla/pull/5137
argilla
: define argilla-v1 as optional dependency by @frascuchon in https://github.com/argilla-io/argilla/pull/5120
argilla
: normalize records when exporting flatten by @frascuchon in https://github.com/argilla-io/argilla/pull/5138
argilla
: support read draft response models without values by @frascuchon in https://github.com/argilla-io/argilla/pull/5124
argilla
: lazy resolution for dataset workspaces by @frascuchon in https://github.com/argilla-io/argilla/pull/5152
argilla
: Rename status
to response.status
for filtering using the SDK by @frascuchon in https://github.com/argilla-io/argilla/pull/5145
argilla-server
: await
on similarity search when filtering response values without user by @frascuchon in https://github.com/argilla-io/argilla/pull/5159
sdk-v1
to legacy
by @frascuchon in https://github.com/argilla-io/argilla/pull/5168
Full Changelog: https://github.com/argilla-io/argilla/compare/v2.0.0rc1...v2.0.0rc2
Published by frascuchon 4 months ago
Dataset
to rule them allThe main difference between Argilla 1.x and Argilla 2.x is that we've converted the previous dataset types tailored for specific NLP tasks into a single highly-configurable Dataset
class.
With the new Dataset
you can combine multiple fields and question types, so you can adapt the UI for your specific project. This offers you more flexibility, while making Argilla easier to learn and maintain.
[!IMPORTANT]
If you want to continue using legacy datasets in Argilla 2.x, you will need to convert them into v2Dataset
's as explained in this migration guide. This includes:DatasetForTextClassification
,DatasetForTokenClassification
, andDatasetForText2Text
.
FeedbackDataset
's do not need to be converted as they are already compatible with the Argilla v2 format.
We've redesigned our SDK with the idea to adapt it to the new single Dataset
class and, most importantly, improve the user and developer experience.
The main goal of the new design is to make the SDK easier to use and learn, making the process to configure your dataset and get it up and running much simpler and faster.
To learn more about this new SDK, you can check:
We have also revamped our UI for Argilla 2.0:
SpanQuestion
's are now supported in the bulk view.https://github.com/argilla-io/argilla/assets/126158523/f77e60de-5824-44ad-8b68-a087b223aa9d
This new version of Argilla comes hand-in-hand with a revamped documentation: https://argilla-io.github.io/argilla/latest
We have applied the Diátaxis framework and UX principles with the hope to make this version cleaner and the information easier to find. Let us know what you think!
[!NOTE]
This is a release candidate ahead of the official Argilla 2.0 release. Try it out and let us know what you think.
Find us in Discord or open a Github issue here.
argilla-sdk
project by @frascuchon in https://github.com/argilla-io/argilla/pull/4891
argilla-sdk
by @frascuchon in https://github.com/argilla-io/argilla/pull/4937
argilla-server
: Query on response values without an user by @frascuchon in https://github.com/argilla-io/argilla/pull/5003
external_id
or id
on bulk operations by @frascuchon in https://github.com/argilla-io/argilla/pull/5014
argilla-v1
- 1.29.0 by @frascuchon in https://github.com/argilla-io/argilla/pull/5032
argilla
release job by @frascuchon in https://github.com/argilla-io/argilla/pull/5037
argilla
: support python 3.12 by @frascuchon in https://github.com/argilla-io/argilla/pull/5040
argilla
: Prevent errors checking Dataset
instances when datasets
is not installed. by @frascuchon in https://github.com/argilla-io/argilla/pull/5045
argilla-server
package release by @frascuchon in https://github.com/argilla-io/argilla/pull/5039
argilla
and argilla-v1
projects by @frascuchon in https://github.com/argilla-io/argilla/pull/5065
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.29.0...v2.0.0rc1
Published by frascuchon 5 months ago
[!WARNING]
This will be the last release of Argilla v1. Starting from Argilla 2.0.0, we will only supportFeedbackDataset
s which will be renamed toDataset
. All other dataset types (DatasetForTextClassification
,DatasetForTokenClassification
, andDatasetForText2Text
) will be deprecated. In the next release, we will provide more information and documentation on how to migrate all your datasets into Argilla 2.0Dataset
s.
Your search matches are now highlighted so you can see easily the result of your search. We’ve also added a selector for datasets with more than one record fields so you can choose whether to do the search on All fields or a specific one.
https://github.com/argilla-io/argilla/assets/126158523/b9af3313-a5c3-46b6-83b7-6624662dba04
You can now check all the information and metadata associated for each record directly in the UI.
https://github.com/argilla-io/argilla/assets/126158523/4a3cc4e0-8be7-4927-8d80-8cf84a0dce8b
v1.29.0
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.28.0...v1.29.0
Published by jfcalvo 6 months ago
https://github.com/argilla-io/argilla/assets/126158523/380004e0-28cb-409f-b11c-71d0e3b6e8bf
MultiLabelQuestion
and RankingQuestion
MultiLabelQuestion
and RankingQuestion
now take one score per suggested label / value, making the scores easier to interpret. Learn more about suggestions and their scores here.
[!WARNING]
If you upgrade to this version all previous scores in suggestions for MultiLabelQuestion, RankingQuestion and SpanQuestion will turn to NULL, as they will not be valid in the new schema. Please, make sure you upload scores again if you want to use them.
Scores are now shown next to its label / value in all questions. This makes them more visible and easier to interpret.
Now you can order labels in MultiLabelQuestion
so that suggestions are always shown first. This will help you make sure that the most relevant labels are always at hand. Plus, if you’ve added scores to your labels, these will be ordered in descending order. To enable this, go to the Dataset Settings page > Questions and enable “Suggestions first” for the desired question.
SpanQuestion
improvementshttps://github.com/argilla-io/argilla/assets/126158523/fad7b9ca-3890-45ed-acc8-5b038a81db06
We’ve improved the way selections are shown. You can now see a highlight that represents what the final selection will look like while you’re dragging your mouse. This will help you with the selection speed and show you the difference between the token vs character selection.
[!NOTE]
Remember that character-level spans are activated by holdingShift
while doing the selection.
We’ve improved the way the label selector works in the SpanQuestion
when overlapping spans are enabled so it’s easier to add or correct labels. Simply click on the desired span to activate the selector and click on the label(s) that you want to add or remove.
We’ve added a warning for Argilla instances deployed on Hugging Face Spaces to alert of data loss when the persistent storage is not enabled.
To learn more about this warning and how to disable it, go to our docs.
labels_order
attribute. (#4757)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.27.0...v1.28.0
Published by jfcalvo 6 months ago
https://github.com/argilla-io/argilla/assets/126158523/380004e0-28cb-409f-b11c-71d0e3b6e8bf
MultiLabelQuestion
and RankingQuestion
MultiLabelQuestion
and RankingQuestion
now take one score per suggested label / value, making the scores easier to interpret. Learn more about suggestions and their scores here.
[!WARNING]
If you upgrade to this version all previous scores in suggestions for MultiLabelQuestion, RankingQuestion and SpanQuestion will turn to NULL, as they will not be valid in the new schema. Please, make sure you upload scores again if you want to use them.
Scores are now shown next to its label / value in all questions. This makes them more visible and easier to interpret.
Now you can order labels in MultiLabelQuestion
so that suggestions are always shown first. This will help you make sure that the most relevant labels are always at hand. Plus, if you’ve added scores to your labels, these will be ordered in descending order. To enable this, go to the Dataset Settings page > Questions and enable “Suggestions first” for the desired question.
SpanQuestion
improvementshttps://github.com/argilla-io/argilla/assets/126158523/fad7b9ca-3890-45ed-acc8-5b038a81db06
We’ve improved the way selections are shown. You can now see a highlight that represents what the final selection will look like while you’re dragging your mouse. This will help you with the selection speed and show you the difference between the token vs character selection.
[!NOTE]
Remember that character-level spans are activated by holdingShift
while doing the selection.
We’ve improved the way the label selector works in the SpanQuestion
when overlapping spans are enabled so it’s easier to add or correct labels. Simply click on the desired span to activate the selector and click on the label(s) that you want to add or remove.
We’ve added a warning for Argilla instances deployed on Hugging Face Spaces to alert of data loss when the persistent storage is not enabled.
To learn more about this warning and how to disable it, go to our docs.
labels_order
attribute. (#4757)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.27.0...v1.28.0
Published by damianpumar 6 months ago
We are finally releasing a much expected feature: overlapping spans. This allows you to draw more than one span over the same token(s)/character(s).
https://github.com/argilla-io/argilla/assets/126158523/3aeb6c6c-b348-4b3d-be67-483636c76293
To try them out, set up a SpanQuestion
with the argument allow_overlap=True
like this:
dataset = rg.FeedbackDataset(
fields = [rg.TextField(name="text")]
questions = [
rg.SpanQuestion(
name="spans",
labels=["label1", "label2", "label3"],
field="text"
)
]
)
Learn more about configuring this and other question types here.
We’ve included a new column in our home page that offers the global progress of your datasets, so that you can see at a glance what datasets are closer to completion.
These bars show progress by grouping records based on the status of their responses:
submitted
status.discarded
status.submitted
and one discarded
response.submitted
or discarded
responses. These may be in pending
or draft
.We’ve improved the way suggestions are shown in the UI to make their purpose clearer: now you can identify each suggestion with a sparkle icon ✨ .
The behavior is still the same:
We’ve increased the limit of labels you can use in Label, Multilabel and Span questions to 500. If you need to go beyond that number, you can set up a custom limit using the following environment variables:
ARGILLA_LABEL_SELECTION_OPTIONS_MAX_ITEMS
to set the limits in label and multi label questions.ARGILLA_SPAN_OPTIONS_MAX_ITEMS
to set the limit in span questions.[!WARNING]
The UI has been optimized to support up to 1000 labels. If you go beyond this limit, the UI may not be as responsive.
Learn more about this and other environment variables here.
Thanks to our contributor @paulbauriegel you can now use Argilla fully in German! If that is the main language of your browser, there is nothing you need to do, the UI will automatically detect that and switch to German.
Would you like to translate Argilla to your own language? Reach out to us and we'll help you!
FeedbackDataset
(#4668)allow_overlapping
parameter for span questions. (#4697)Datasets
table (#4696)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.26.1...v1.27.0
Published by jfcalvo 7 months ago
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.26.0...v1.26.1
Published by jfcalvo 7 months ago
We've added a new type of question to Feedback Datasets: the SpanQuestion
. This type of question allows you to highlight portions of text in a specific field and apply a label. It is specially useful for token classification (like NER or POS tagging) and information extraction tasks.
https://github.com/argilla-io/argilla/assets/126158523/d3821d49-6da0-4488-99e2-068d7411268a
With this type of question you can:
✨ Provide suggested spans with a confidence score, so your team doesn't need to start from scratch.
⌨️ Choose a label using your mouse or with the keyboard shortcut provided next to the label.
🖱️ Draw a span by dragging your mouse over the parts of the text you want to select or if it's a single token, just double-click on it.
🪄 Forget about mistakes with token boundaries. The UI will snap your spans to token boundaries for you.
🔎 Annotate at character-level when you need more fine-grained spans. Hold the Shift
key while drawing the span and the resulting span will start and end in the exact boundaries of your selection.
✔️ Quickly change the label of a span by clicking on the label name and selecting the correct one from the dropdown.
🖍️ Correct a span at the speed of light by simply drawing the correct span over it. The new span will overwrite the old one.
🧼 Remove labels by hovering over the label name in the span and then click on the 𐢫 on the left hand side.
Here's an example of what your dataset would look like from the SDK:
import argilla as rg
from argilla.client.feedback.schemas import SpanValueSchema
#connect to your Argilla instance
rg.init(...)
# create a dataset with a span question
dataset = rg.FeedbackDataset(
fields=[rg.TextField(name="text"),
questions=[
rg.SpanQuestion(
name="entities",
title="Highlight the entities in the text:",
labels={"PER": "Person", "ORG": "Organization", "EVE": "Event"}, # or ["PER", "ORG", "EVE"]
field="text", # the field where you want to do the span annotation
required=True
)
]
)
# create a record with suggested spans
record = rg.FeedbackRecord(
fields={"text": "This is the text of the record"}
suggestions = [
{
"question_name": "entities",
"value": [
SpanValueSchema(
start=0, # position of the first character of the span
end=10, # position of the character right after the end of the span
label="ORG",
score=1.0
)
],
"agent": "my_model",
}
]
)
# add records to the dataset and push to Argilla
dataset.add_records([record])
dataset.push_to_argilla(...)
To learn more about this and all the other questions available in Feedback Datasets, check out our documentation on:
single or multi
label Question, the state is maintained during the entire annotation process. (#4630)span
questions for FeedbackDataset
. (#4622)ARGILLA_CACHE_DIR
environment variable to configure the client cache directory. (#4509)RankingValueSchema
instances to suggestions. (#4628)ds.pull
or iterating over the dataset. (#4662)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.25.0...v1.26.0
Published by frascuchon 8 months ago
admin
and owner
users can now change the order in which labels appear in the question form. To do this, go to the Questions
tab inside Dataset Settings and move the labels until they are in the desired order.
https://github.com/argilla-io/argilla/assets/126158523/40f382a5-35c6-4bea-b15c-f001f539940d
The missing
status has been removed from the SDK filters. To filter records that don't have responses you will now need to use the pending
status like so:
filtered_dataset = dataset.filter_by(response_status="pending")
Learn more about how to use this filter in our docs
We’ve removed the limitation to use pandas <2.0.0
so you can now use Argilla with pandas v1 or v2 safely.
[!NOTE]
For changes in the argilla-server module, visit the argilla-server release notes
dataset settings page
for single/multi label questions (#4598)missing
response for status filter. Use pending
instead. (#4533)user-settings
instead of 404 user_settings
(#4609)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.24.0....v1.25.0
Published by frascuchon 9 months ago
[!Note]
This release does not contain any new features, but it includes a major change in the argilla server.
The package is using theargilla-server
dependency defined here.
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.23.1...v1.24.0
Published by frascuchon 9 months ago
Full Changelog: https://github.com/argilla-io/argilla/compare/v1.23.0...v1.23.1
Published by jfcalvo 9 months ago
You can now set up OAuth in your Argilla Hugging Face spaces. This is a simple way to have your team members or collaborators in crowdsourced projects sign in and log in to your space using their Hugging face accounts.
To learn how to set up Hugging Face OAuth for your Argilla Space, go to our docs.
We’ve added an improvement for our bulk view so you can perform actions on all results from a filter (or a combination of them!).
To use this, go to the bulk view and apply some filter(s) of your choice. If the results are more than the records seen in the current page, when you click the checkbox you will see the option to select all of the results. Then, you can give responses, discard, save a draft and even submit all of the records at once!
We’ve added the pdf_to_html
function in our utilities so you can easily embed a PDF reader within a TextField using markdown.
This function accepts either the file path, the URLs or the file's byte data and returns the corresponding HTML to render the PDF within the Argilla user interface.
Learn more about how to use this feature here.
Record
schema now always include dataset_id
as attribute. (#4482)Response
schema now always include record_id
as attribute. (#4482)Question
schema now always include dataset_id
attribute. (#4487)Field
schema now always include dataset_id
attribute. (#4488)MetadataProperty
schema now always include dataset_id
attribute. (#4489)VectorSettings
schema now always include dataset_id
attribute. (#4490)pdf_to_html
function to .html_utils
module that convert PDFs to dataURL to be able to render them in tha Argilla UI. (#4481)ARGILLA_AUTH_SECRET_KEY
environment variable. (#4539)ARGILLA_AUTH_ALGORITHM
environment variable. (#4539)ARGILLA_AUTH_TOKEN_EXPIRATION
environment variable. (#4539)ARGILLA_AUTH_OAUTH_CFG
environment variable. (#4546)ARGILLA_LOCAL_AUTH_*
environment variables. Will be removed in the release v1.25.0. (#4539)username
attribute in UserCreate
. Now uppercase letters are allowed. (#4544)Authorization
header from python SDK requests. (#4535)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.22.0...v1.23.0
Published by frascuchon 9 months ago
Our signature bulk actions are now available for Feedback datasets!
Switch between Focus and Bulk depending on your needs:
For now, this is only available in the Pending queue, but rest assured, bulk actions will be improved and extended to other queues in upcoming releases.
Read more about our Focus and Bulk views here.
We now support sorting records in the Argilla UI based on the values of Rating questions (both suggestions and responses):
Learn about this and other filters in our docs.
It’s now easier than ever to add vector embeddings to your records with the new Sentence Transformers integration.
Just choose a model from the Hugging Face hub and use our SentenceTransformersExtractor
to add vectors to your dataset:
import argilla as rg
from argilla.client.feedback.integrations.sentencetransformers import SentenceTransformersExtractor
# Connect to Argilla
rg.init(
api_url="http://localhost:6900",
api_key="owner.apikey",
workspace="my_workspace"
)
# Initialize the SentenceTransformersExtractor
ste = SentenceTransformersExtractor(
model = "TaylorAI/bge-micro-v2", # Use a model from https://huggingface.co/models?library=sentence-transformers
show_progress = False,
)
# Load a dataset from your Argilla instance
ds_remote = rg.FeedbackDataset.from_argilla("my_dataset")
# Update the dataset
ste.update_dataset(
dataset=ds_remote,
fields=["context"], # Only update the context field
update_records=True, # Update the records in the dataset
overwrite=False, # Overwrite existing fields
)
Learn more about this functionality in this tutorial.
vector_settings
to the __repr__
method of the FeedbackDataset
and RemoteFeedbackDataset
. (#4454)sentence-transformers
using SentenceTransformersExtractor
to configure vector_settings
in FeedbackDataset
and FeedbackRecord
. (#4454)argilla.cli.server
definitions have been moved to argilla.server.cli
module. (#4472)vector_settings_by_name
for generic property_by_name
usage, which will return None
instead of raising an error. (#4454)ES_INDEX_REGEX_PATTERN
in module argilla._constants
is now private. (#4472)nan
values in metadata properties will raise a 422 error when creating/updating records. (#4300)None
values are now allowed in metadata properties. (#4300)missing
response status for filtering records is deprecated and will be removed in the release v1.24.0. Use pending
instead. (#4433)python -m argilla database
command has been removed. (#4472)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.21.0...v1.22.0
Published by damianpumar 10 months ago
FeedbackDataset
(argilla.client.feedback.metrics
). (#4175).401
HTTP status code` (#4362)textdescriptives
using TextDescriptivesExtractor
to configure metadata_properties
in FeedbackDataset
and FeedbackRecord
. (#4400). Contributed by @m-newhauserPOST /api/v1/me/responses/bulk
endpoint to create responses in bulk for current user. (#4380)httpx_extra_kwargs
argument to rg.init
and Argilla
to allow passing extra arguments to httpx.Client
used by Argilla
. (#4440)ArgillaSingleton
, init
and active_client
to a new module singleton
. (#4347)argilla.load
functions to also work with FeedbackDataset
s. (#4347)argilla.delete
functions to also work with FeedbackDataset
s. It now raises an error if the dataset does not exist. (#4347)argilla.list_datasets
functions to also work with FeedbackDataset
s. (#4347)TextClassificationSettings.from_dict
method in which the label_schema
created was a list of dict
instead of a list of str
. (#4347)draft
auto save for annotation view (#4334)Published by davidberenstein1957 11 months ago
We’ve added new filters in the Argilla UI to filter records within Feedback datasets based on response values and suggestions information. It is also possible to sort records based on suggestion scores. This is available for questions of the type: LabelQuestion
, MultiLabelQuestion
and RatingQuestion
.
We added several methods to assign records to annotators via controlled overlap assign_records
and assign_workspaces
.
from argilla.client.feedback.utils import assign_records
assignments = assign_records(
users=users,
records=records,
overlap=1,
shuffle=True
)
from argilla.client.feedback.utils import assign_workspaces
assignments = assign_workspaces(
assignments=assignments,
workspace_type="individual"
)
for username, records in assignments.items():
dataset = rg.FeedbackDataset(
fields=fields, questions=questions, metadata=metadata,
vector_settings=vector_settings, guidelines=guidelines
)
dataset.add_records(records)
remote_dataset = dataset.push_to_argilla(name="my_dataset", workspace=username)
Argilla supports basic handling of video, audio, and images within markdown fields, provided they are formatted in HTML. To facilitate this, we offer three functions: video_to_html
, audio_to_html
, and image_to_html
. Note that performance differs per browser and database configuration.
from argilla.client.feedback.utils import audio_to_html, image_to_html, video_to_html
# Configure the FeedbackDataset
ds_multi_modal = rg.FeedbackDataset(
fields=[rg.TextField(name="content", use_markdown=True, required=True)],
questions=[rg.TextQuestion(name="description", title="Describe the content of the media:", use_markdown=True, required=True)],
)
# Add the records
records = [
rg.FeedbackRecord(fields={"content": video_to_html("/content/snapshot.mp4")}),
rg.FeedbackRecord(fields={"content": audio_to_html("/content/sea.wav")}),
rg.FeedbackRecord(fields={"content": image_to_html("/content/peacock.jpg")}),
]
ds_multi_modal.add_records(records)
# Push the dataset to Argilla
ds_multi_modal = ds_multi_modal.push_to_argilla("multi-modal-basic", workspace="admin")
You can also add custom highlights to the text by using create_token_highlights
and a custom color map.
from argilla.client.feedback.utils import create_token_highlights
tokens = ["This", "is", "a", "test"]
weights = [0.1, 0.2, 0.3, 0.4]
html = create_token_highlights(tokens, weights, c_map=custom_RGB) # 'viridis' by default
GET /api/v1/datasets/:dataset_id/records/search/suggestions/options
endpoint to return suggestion available options for searching. (#4260)metadata_properties
to the __repr__
method of the FeedbackDataset
and RemoteFeedbackDataset
.(#4192).get_model_kwargs
, get_trainer_kwargs
, get_trainer_model
, get_trainer_tokenizer
and get_trainer
-methods to the ArgillaTrainer
to improve interoperability across frameworks. (#4214).ArgillaTrainer
to allow for better interoperability of defaults
and formatting_func
usage. (#4214).update_config
-method of ArgillaTrainer
to emphasize if the kwargs
were updated correctly. (#4214).argilla.client.feedback.utils
module with html_utils
(this mainly includes video/audio/image_to_html
that convert media to dataURL to be able to render them in tha Argilla UI and create_token_highlights
to highlight tokens in a custom way. Both work on TextQuestion and TextField with use_markdown=True) and assignments
(this mainly includes assign_records
to assign records according to a number of annotators and records, an overlap and the shuffle option; and assign_workspace
to assign and create if needed a workspace according to the record assignment). (#4121)ArgillaTrainer
, with numerical labels, using RatingQuestion
instead of RankingQuestion
(#4171)ArgillaTrainer
, now we can train for extractive_question_answering
using a validation sample (#4204)ArgillaTrainer
, when training for sentence-similarity
it didn't work with a list of values per record (#4211)RankingQuestion
(#4295)TextClassificationSettings.labels_schema
order was not being preserved. Closes #3828 (#4332)draft
responses to create records endpoint. (#4354)agent
field only accepts now some specific characters and a limited length. (#4265)score
field only accepts now float values in the range 0
to 1
. (#4266)POST /api/v1/dataset/:dataset_id/records/search
endpoint to support optional query
attribute. (#4327)POST /api/v1/dataset/:dataset_id/records/search
endpoint to support filter
and sort
attributes. (#4327)POST /api/v1/me/datasets/:dataset_id/records/search
endpoint to support optional query
attribute. (#4270)POST /api/v1/me/datasets/:dataset_id/records/search
endpoint to support filter
and sort
attributes. (#4270)FeedbackDataset
to Argilla from tqdm
style to rich
. (#4267). Contributed by @zucchini-nlp.push_to_argilla
to print repr
of the pushed RemoteFeedbackDataset
after push and changed show_progress
to True by default. (#4223)models
and tokenizer
for the ArgillaTrainer
to explicitly allow for changing them when needed. (#4214).Published by davidberenstein1957 11 months ago
We have chosen to disable raining a ValueError
during the FeedbackDataset.*_by_name()
: FeedbackDataset.question_by_name()
, FeedbackDataset.field_by_name()
and FeedbackDataset.metadata_property_by_name
. Instead, these methods will now return None
when no match is found. This change is backwards compatible with previous versions of Argilla but might break your code if you are relying on the ValueError
to be raised.
If you have included vectors and vector settings in your dataset, you can use the similarity search features within that dataset.
In the Argilla UI, you can find records that are similar to each other using the Find similar
button at the top right corner of the record card. Here's how to do it:
In the SDK, you can do the same like this:
ds = rg.FeedbackDataset.from_argilla("my_dataset", workspace="my_workspace")
# using another record
similar_records = ds.find_similar_records(
vector_name="my_vector",
record=ds.records[0],
max_results=5
)
# work with the resulting tuples
for record, score in similar_records:
...
You can also find records that are similar to a given text, but bear in mind that the dimensions of the resulting vector should be equal to that of the vector used in the dataset records:
similar_records = ds.find_similar_records(
vector_name="my_vector",
value=embedder_model.embeddings("My text is here")
# value=embedder_model.embeddings("My text is here").tolist() # for numpy arrays
)
FeedbackDataset
You can now add vectors to your Feedback dataset and records to enable similarity search.
To do that, first, you need to add vector settings to your dataset:
dataset = rg.FeedbackDataset(
fields=[...],
questions=[....],
vector_settings=[
rg.VectorSettings(
name="my_vectors",
dimensions=768,
tite="My Vectors" #optional
)
]
)
Then, you can add vectors to your records where the key matches the name
of your vector settings and the value is a List[float]
:
record = rg.FeedbackRecord(
fields={...},
vectors={"my_vectors": [...]}
)
⚠️ For vector search in OpenSearch, the filtering applied is using a post_filter
step, since there is a bug that makes queries fail using filtering + KNN from Argilla.
See https://github.com/opensearch-project/k-NN/issues/1286
[TODO: Add a link to the docs]
FeedbackDataset
We added a show_progress
argument to from_huggingface()
method to make the progress bar for the parsing records process optional.
RemoteFeedbackDataset
We have added additional support for the pull()
-method of RemoteFeedbackDataset
. It is now possible to pull a RemoteFeedbackDataset
with a specific max_records
-argument. In combination with the earlier introduced filter_by
and sorty_by
this allows for more fine-grained control over the records that are pulled from Argilla.
ArgillaTrainer
The ArgillaTrainer
class has been updated to support additional features. Hugging Face models can now be shared to the Hugging Face Hub directly from the ArgillaTrainer.push_to_huggingface
-method. Additionally, we have included filter_by
, sort_by
, and max_records
arguments to the `ArgillaTrainer '-initialisation-method to allow for more fine-grained control over the records used for training.
from argilla import SortBy
trainer = ArgillaTrainer(
dataset=dataset,
task=task,
framework="setfit",
filter_by={"response_status": ["submitted"]},
sort_by=[SortBy(field="metadata.my-metadata", order="asc")],
max_records=1000
)
inserted_at
and updated_at
datetime fields.POST /api/v1/datasets/:dataset_id/records/search
endpoint to search for records without user context, including responses by all users. (#4143)POST /api/v1/datasets/:dataset_id/vectors-settings
endpoint for creating vector settings for a dataset. (#3776)GET /api/v1/datasets/:dataset_id/vectors-settings
endpoint for listing the vectors settings for a dataset. (#3776)DELETE /api/v1/vectors-settings/:vector_settings_id
endpoint for deleting a vector settings. (#3776)PATCH /api/v1/vectors-settings/:vector_settings_id
endpoint for updating a vector settings. (#4092)GET /api/v1/records/:record_id
endpoint to get a specific record. (#4039)GET /api/v1/datasets/:dataset_id/records
endpoint response using include
query param. (#4063)GET /api/v1/me/datasets/:dataset_id/records
endpoint response using include
query param. (#4063)POST /api/v1/me/datasets/:dataset_id/records/search
endpoint response using include
query param. (#4063)show_progress
argument to from_huggingface()
method to make the progress bar for parsing records process optional.(#4132).from_huggingface()
method with trange
in tqdm
.(#4132).inserted_at
or updated_at
for datasets with no metadata. (4147)max_records
argument to pull()
method for RemoteFeedbackDataset
.(#4074)ArgillaTrainer.push_to_huggingface
(#3976). Contributed by @Racso-3141.filter_by
argument to ArgillaTrainer
to filter by response_status
(#4120).sort_by
argument to ArgillaTrainer
to sort by metadata
(#4120).max_records
argument to ArgillaTrainer
to limit record used for training (#4120).add_vector_settings
method to local and remote FeedbackDataset
. (#4055)update_vectors_settings
method to local and remote FeedbackDataset
. (#4122)delete_vectors_settings
method to local and remote FeedbackDataset
. (#4130)vector_settings_by_name
method to local and remote FeedbackDataset
. (#4055)find_similar_records
method to local and remote FeedbackDataset
. (#4023)ARGILLA_SEARCH_ENGINE
environment variable to configure the search engine to use. (#4019)ARGILLA_SEARCH_ENGINE=opensearch
. (#4019 and #4111)FeedbackDataset.*_by_name()
methods to return None
when no match is found (#4101).limit
query parameter for GET /api/v1/datasets/:dataset_id/records
endpoint is now only accepting values greater or equal than 1
and less or equal than 1000
. (#4143)limit
query parameter for GET /api/v1/me/datasets/:dataset_id/records
endpoint is now only accepting values greater or equal than 1
and less or equal than 1000
. (#4143)GET /api/v1/datasets/:dataset_id/records
endpoint to fetch record using the search engine. (#4142)GET /api/v1/me/datasets/:dataset_id/records
endpoint to fetch record using the search engine. (#4142)POST /api/v1/datasets/:dataset_id/records
endpoint to allow to create records with vectors
(#4022)PATCH /api/v1/datasets/:dataset_id
endpoint to allow updating allow_extra_metadata
attribute. (#4112)PATCH /api/v1/datasets/:dataset_id/records
endpoint to allow to update records with vectors
. (#4062)PATCH /api/v1/records/:record_id
endpoint to allow to update record with vectors
. (#4062)POST /api/v1/me/datasets/:dataset_id/records/search
endpoint to allow to search records with vectors. (#4019)BaseElasticAndOpenSearchEngine.index_records
method to also index record vectors. (#4062)FeedbackDataset.__init__
to allow passing a list of vector settings. (#4055)FeedbackDataset.push_to_argilla
to also push vector settings. (#4055)FeedbackDatasetRecord
to support the creation of records with vectors. (#4043)from_huggingface()
method with trange
in tqdm
.(#4132).Published by frascuchon 12 months ago
You can now filter and sort records in Feedback Datasets in the UI and Python SDK using the metadata included in the records. To do that, you will first need to set up a MetadataProperty
in your dataset:
# set up a dataset including metadata properties
dataset = rg.FeedbackDataset(
fields=[
rg.TextField(name="prompt"),
rg.TextField(name="response"),
],
questions=[
rg.TextQuestion(name="question")
],
metadata_properties=[
rg.TermsMetadataProperty(name="source"),
rg.IntegerMetadataProperty(name="response_length", title="Response length")
]
)
Learn more about how to define metadata properties or adding or deleting metadata properties in existing datasets.
This will read the metadata in the records that match the name of the metadata property. Any other metadata present in the record not matching a metadata property will be saved but not available to use in the filtering and sorting features in the UI or SDK.
# create a record with metadata
record = rg.FeedbackRecord(
fields={
"prompt": "Why can camels survive long without water?",
"response": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."
},
metadata={"source": "wikipedia", "response_length": 105, "my_hidden_metadata": "hidden metadata"}
)
Learn more about how to create records with metadata and how to add, modify or delete metadata from existing records.
In the Python SDK, you can filter and sort records based on the Metadata Properties that you set up for your dataset. You can combine multiple filters and sorts. Here is an example of how you could use them:
filtered_records = remote.filter_by(
metadata_filters=[
rg.IntegerMetadataFilter(
name="response_length",
ge=500, # optional: greater or equal to
le=1000 # optional: lower or equal to
),
rg.TermsMetadataFilter(
name="source",
values=["wikipedia", "wikihow"]
)
]
).sort_by(
[
rg.SortBy(
field="response_length",
order="desc" # for descending or "asc" for ascending
)
]
In the UI, simply use the Metadata
and Sort
components to filter and sort records like this:
https://github.com/argilla-io/argilla/assets/126158523/6a5a7984-425d-4f1a-b0f7-7cc2bb7e4a0a
Read more about filtering and sorting in Feedback Datasets.
From version 1.17.0 a new argilla
os user is configured for the provided docker images. If you are using the docker deployment and you want to upload to this version from versions older than v1.17.0 (If you already updated from v1.17.0 this step was already applied - see Release Notes), you should change permissions to the SQLite db file, before upgrading the version. You can do it with the following action:
docker exec --user root <argilla_server_container_id> /bin/bash -c 'chmod -R 777 "$ARGILLA_HOME_PATH"'
Note: You can find the docker container id by running:
docker ps | grep -i argilla-server
713973693fb7 argilla/argilla-server:v1.16.0 "/bin/bash start_arg…" 11 hours ago Up 7 minutes 0.0.0.0:6900->6900/tcp docker-argilla-1
Once the version is upgraded, we recommend to provided proper security access to this folder by setting the user and group to the new argilla
user:
docker exec --user root <argilla_server_container_id> /bin/bash -c 'chown -R argilla:argilla "$ARGILLA_HOME_PATH"'
GET /api/v1/datasets/:dataset_id/metadata-properties
endpoint for listing dataset metadata properties. (#3813)POST /api/v1/datasets/:dataset_id/metadata-properties
endpoint for creating dataset metadata properties. (#3813)PATCH /api/v1/metadata-properties/:metadata_property_id
endpoint allowing the update of a specific metadata property. (#3952)DELETE /api/v1/metadata-properties/:metadata_property_id
endpoint for deletion of a specific metadata property. (#3911)GET /api/v1/metadata-properties/:metadata_property_id/metrics
endpoint to compute metrics for a specific metadata property. (#3856)PATCH /api/v1/records/:record_id
endpoint to update a record. (#3920)PATCH /api/v1/dataset/:dataset_id/records
endpoint to bulk update the records of a dataset. (#3934)PATCH /api/v1/questions/:question_id
. Now title
and description
are using the same validations used to create questions. (#3967)TermsMetadataProperty
, IntegerMetadataProperty
and FloatMetadataProperty
classes allowing to define metadata properties for a FeedbackDataset
. (#3818)metadata_filters
to filter_by
method in RemoteFeedbackDataset
to filter based on metadata i.e. TermsMetadataFilter
, IntegerMetadataFilter
, and FloatMetadataFilter
. (#3834)metadata_properties
and metadata_filters
in their schemas and as part of the add_records
and filter_by
methods, respectively. (#3860)sort_by
query parameter to listing records endpoints that allows to sort the records by inserted_at
, updated_at
or metadata property. (#3843)add_metadata_property
method to both FeedbackDataset
and RemoteFeedbackDataset
(i.e. FeedbackDataset
in Argilla). (#3900)inserted_at
and updated_at
in RemoteResponseSchema
. (#3822)sort_by
for RemoteFeedbackDataset
i.e. a FeedbackDataset
uploaded to Argilla. (#3925)metadata_properties
support for both push_to_huggingface
and from_huggingface
. (#3947)metadata
) from Python SDK. (#3946)delete_metadata_properties
method to delete metadata properties. (#3932)update_metadata_properties
method to update metadata_properties
. (#3961)ArgillaTrainer.save
(#3857)FeedbackDataset
TaskTemplateMixin
for pre-defined task templates. (#3969)last_activity_at
field to FeedbackDataset
exposing when the last activity for the associated dataset occurs. (#3992)GET /api/v1/datasets/{dataset_id}/records
, GET /api/v1/me/datasets/{dataset_id}/records
and POST /api/v1/me/datasets/{dataset_id}/records/search
endpoints to return the total
number of records. (#3848, #3903)__len__
method for filtered datasets to return the number of records matching the provided filters. (#3916)values
to be None
i.e. when a record is discarded the response.values
are set to None
. (#3926)Full Changelog: https://github.com/argilla-io/argilla/compare/v1.17.0...v1.18.0
Published by frascuchon about 1 year ago
This release comes with a lot of new goodies and quality improvements. We added model card support for the ArgillaTrainer
, worked on the FeedbackDataset
task templates and added timestamps to responses. We also fixed a lot of bugs and improved the overall quality of the codebase. Enjoy!
The quickstart image startup script was changed from from /start_quickstart.sh
to /home/argilla/start_quickstart.sh
, which might cause existing Hugging Face Spaces deployments to malfunction. A fix was added for the Argilla template space via this PR. Alternatively, you can just create a new deployment.
From version 1.17.0 a new argilla
os user is configured for the provided docker images. If you are using the docker deployment and you want to upload to this version, you should do some actions once update your container and before working with Argilla. Execute the following command:
docker exec --user root <argilla_server_container_id> /bin/bash -c 'chown -R argilla:argilla "$ARGILLA_HOME_PATH"'
This will change the permissions on the argilla home path, which allows it to work with new containers.
Note: You can find the docker container id by running:
docker ps | grep -i argilla-server
713973693fb7 argilla/argilla-server:v1.17.0 "/bin/bash start_arg…" 11 hours ago Up 7 minutes 0.0.0.0:6900->6900/tcp docker-argilla-1
ArgillaTrainer
Model Card GenerationThe ArgillaTrainer
now supports automatic model card generation. This means that you can now generate a model card with all the required info for Hugging Face and directly share these models to the hub, as you would expect within the Hugging Face ecosystem. See the docs for more info.
model_card_kwargs = {
"language": ["en", "es"],
"license": "Apache-2.0",
"model_id": "all-MiniLM-L6-v2",
"dataset_name": "argilla/emotion",
"tags": ["nlp", "few-shot-learning", "argilla", "setfit"],
"model_summary": "Small summary of what the model does",
"model_description": "An extended explanation of the model",
"model_type": "A 1.3B parameter embedding model fine-tuned on an awesome dataset",
"finetuned_from": "all-MiniLM-L6-v2",
"repo": "https://github.com/..."
"developers": "",
"shared_by": "",
}
trainer = ArgillaTrainer(
dataset=dataset,
task=task,
framework="setfit",
framework_kwargs={"model_card_kwargs": model_card_kwargs}
)
trainer.train(output_dir="my_model")
# or get the card as `str` by calling the `generate_model_card` method
argilla_model_card = trainer.generate_model_card("my_model")
FeedbackDataset
Task TemplatesThe Argilla FeedbackDataset
now supports a number of task templates that can be used to quickly create a dataset for specific tasks out of the box. This should help starting users get right into the action without having to worry about the dataset structure. We support basic tasks like Text Classification but also allow you to setup complex RAG-pipelines. See the docs for more info.
import argilla as rg
ds = rg.FeedbackDataset.for_text_classification(
labels=["positive", "negative"],
multi_label=False,
use_markdown=True,
guidelines=None,
)
ds
# FeedbackDataset(
# fields=[TextField(name="text", use_markdown=True)],
# questions=[LabelQuestion(name="label", labels=["positive", "negative"])]
# guidelines="<Guidelines for the task>",
# )
inserted_at
and updated_at
are added to responsesWhat are responses without timestamps? The RemoteResponseSchema
now supports inserted_at
and updated_at
fields. This should help you to keep track of the time when a response was created and updated. Perfectly, for keeping track of annotator performance within your company.
inserted_at
and updated_at
in RemoteResponseSchema
(#3822).ArgillaTrainer.save
(#3857).FeedbackDataset
(#3973).Dockerfile
to use multi stage build (#3221 and #3793).unify_responses
support for remote datasets (#3937).TextClassificationRecord
(#3831).required=True
) when the field value was None
(#3846).pretrained_model_name_or_path
attribute as string in ArgillaTrainer
(#3914).inserted_at
and updated_at
attributes are create using the utcnow
factory to avoid unexpected race conditions on timestamp creation (#3945)configure_dataset_settings
when providing the workspace via the arg workspace
(#3887).ArgillaTrainer
with a peft_config
parameter (#3795).from_huggingface
when loading a FeedbackDataset
from the Hugging Face Hub that was previously dumped using another version of Argilla, starting at 1.8.0, when it was first introduced (#3829).TrainingTaskForQuestionAnswering.__repr__
(#3969)TrainingTask.prepare_for_training_with_*
-methods (#3969)rg.configure_dataset
is deprecated in favour of rg.configure_dataset_settings
. The former will be removed in version 1.19.0Full Changelog: https://github.com/argilla-io/argilla/compare/v1.16.0...v1.17.0
Published by gabrielmbmb about 1 year ago
This release comes with an auto save feature for the UI, an enhanced Argilla CLI app, new keyboard shortcuts for the annotation process in the Feedback Dataset and new integrations for the ArgillaTrainer
.
Have you been writing a long corrected text in a TextField
for a completion given by an LLM and you have refreshed the page before submitting it? Well, since this release you are covered! The Argilla UI will save every few seconds the responses given in the annotation form of a FeedbackDataset
. Annotators can partially annotate one record and then come back to finish the annotation process without losing the previous work.
The Argilla CLI has been updated to include an extensive list of new commands, from users and datasets management to training models all from the terminal!
Now, you can seamlessly navigate within the feedback form using just your keyboard. We've extended the functionality of these shortcuts to cover all types of available questions: Label, Multi-label, Ranking, Rating and Text
ArgillaTrainer
The ArgillaTrainer
doesn't stop getting new features and improvements!
TrainingTask
has been added for Question and Answering (QnA)
FeedbackDataset
for fine-tuning an OpenAI model for Chat Completion
ArgillaTrainer
integration with sentence-transformers, allowing fine tuning for sentence similarity (#3739)ArgillaTrainer
integration with TrainingTask.for_question_answering
(#3740)Auto save record
to save automatically the current record that you are working on (#3541)ArgillaTrainer
integration with OpenAI, allowing fine tuning for chat completion (#3615)workspaces list
command to list Argilla workspaces (#3594).datasets list
command to list Argilla datasets (#3658).users create
command to create users (#3667).whoami
command to get current user (#3673).users delete
command to delete users (#3671).users list
command to list users (#3688).workspaces delete-user
command to remove a user from a workspace (#3699).datasets list
command to list Argilla datasets (#3658).users create
command to create users (#3667).users delete
command to delete users (#3671).workspaces create
command to create an Argilla workspace (#3676).datasets push-to-hub
command to push a FeedbackDataset
from Argilla into the HuggingFace Hub (#3685).info
command to get info about the used Argilla client and server (#3707).datasets delete
command to delete a FeedbackDataset
from Argilla (#3703).created_at
and updated_at
properties to RemoteFeedbackDataset
and FilteredRemoteFeedbackDataset
(#3709).PermissionError
when executing a command with a logged in user with not enough permissions (#3717).workspaces add-user
command to add a user to workspace (#3712).workspace_id
param to GET /api/v1/me/datasets
endpoint (#3727).workspace_id
arg to list_datasets
in the Python SDK (#3727).argilla
script that allows to execute Argilla CLI using the argilla
command (#3730).server_info
function to check the Argilla server information (also accessible via rg.server_info
) (#3772).database
commands under server
group of commands (#3710)server
commands only included in the CLI app when server
extra requirements are installed (#3710).PUT /api/v1/responses/{response_id}
to replace values
stored with received values
in request (#3711).UserWarning
when the user_id
in Workspace.add_user
and Workspace.delete_user
is the ID of an user with the owner role as they don't require explicit permissions (#3716).tasks
sub-package to cli
(#3723).argilla database
command in the CLI to now be accessed via argilla server database
, to be deprecated in the upcoming release (#3754).visible_options
(of label and multi label selection questions) validation in the backend to check that the provided value is greater or equal than/to 3 and less or equal than/to the number of provided options (#3773).remove user modification in text component on clear answers
(#3775)Highlight raw text field in dataset feedback task
(#3731)Field title too long
(#3734)DatasetForTextClassification
(#3652)Pending queue
pagination problems when during data annotation (#3677)visible_labels
default value to be 20 just when visible_labels
not provided and len(labels) > 20
, otherwise it will either be the provided visible_labels
value or None
, for LabelQuestion
and MultiLabelQuestion
(#3702).DatasetCard
generation when RemoteFeedbackDataset
contains suggestions (#3718).draft
status in ResponseSchema
as now there can be responses with draft
status when annotating via the UI (#3749)./api/datasets
endpoints due to the TaskType
enum replacement in the endpoint URL (#3769).Full Changelog: https://github.com/argilla-io/argilla/compare/v1.15.1...v1.16.0