Dataset management¶
This guide provides an overview of datasets, explaining the basics of how to set them up and manage them in Argilla.
A dataset is a collection of records that you can configure for labelers to provide feedback using the UI. Depending on the specific requirements of your task, you may need various types of feedback. You can customize the dataset to include different kinds of questions, so the first step will be to define the aim of your project and the kind of data and feedback you will need. With this information, you can start configuring a dataset by defining fields, questions, metadata, vectors, and guidelines through settings.
Question: Who can manage datasets?
Only users with the owner
role can manage (create, retrieve, update and delete) all the datasets.
The users with the admin
role can manage (create, retrieve, update and delete) the datasets in the workspaces they have access to.
Main Classes
Check the Dataset - Python Reference to see the attributes, arguments, and methods of the
Dataset
class in detail.
rg.Settings(
fields=[rg.TextField(name="text")],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
)
],
metadata=[rg.TermsMetadataProperty(name="metadata")],
vectors=[rg.VectorField(name="vector", dimensions=10)],
guidelines="guidelines",
allow_extra_metadata=True,
distribution=rg.TaskDistribution(min_submitted=2),
)
Check the Settings - Python Reference to see the attributes, arguments, and methods of the
Settings
class in detail.
Create a dataset¶
To create a dataset, you can define it in the Dataset
class and then call the create
method that will send the dataset to the server so that it can be visualized in the UI. If the dataset does not appear in the UI, you may need to click the refresh button to update the view. For further configuration of the dataset, you can refer to the settings section.
Info
If you have deployed Argilla with Hugging Face Spaces and HF Sign in, you can use argilla
as a workspace name. Otherwise, you might need to create a workspace following this guide.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[
rg.TextField(
name="text",
),
],
questions=[
rg.LabelQuestion(
name="label",
labels=["label_1", "label_2", "label_3"]
),
],
)
dataset = rg.Dataset(
name="my_dataset",
workspace="my_workspace",
settings=settings,
)
dataset.create()
The created dataset will be empty, to add records go to this how-to guide.
Accessing attributes
Access the attributes of a dataset by calling them directly on the dataset
object. For example, dataset.id
, dataset.name
or dataset.settings
. You can similarly access the fields, questions, metadata, vectors and guidelines. For instance, dataset.fields
or dataset.questions
.
Create multiple datasets with the same settings¶
To create multiple datasets with the same settings, define the settings once and pass it to each dataset.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
settings = rg.Settings(
guidelines="These are some guidelines.",
fields=[rg.TextField(name="text", use_markdown=True)],
questions=[
rg.LabelQuestion(name="label", labels=["label_1", "label_2", "label_3"])
],
distribution=rg.TaskDistribution(min_submitted=3),
)
dataset1 = rg.Dataset(name="my_dataset_1", settings=settings)
dataset2 = rg.Dataset(name="my_dataset_2", settings=settings)
# Create the datasets on the server
dataset1.create()
dataset2.create()
Create a dataset from an existing dataset¶
To create a new dataset from an existing dataset, get the settings from the existing dataset and pass them to the new dataset.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
existing_dataset = client.datasets("my_dataset")
new_dataset = rg.Dataset(name="my_dataset_copy", settings=existing_dataset.settings)
new_dataset.create()
Info
You can also copy the records from the original dataset to the new one:
Define dataset settings¶
Tip
Instead of defining your own custom settings, you can use some of our pre-built templates for text classification, ranking and rating. Learn more here.
Fields¶
The fields in a dataset consist of one or more data items requiring annotation. Currently, Argilla supports plain text and markdown through the TextField
, images through the ImageField
, chat formatted data through the ChatField
and full custom templates through our CustomField
.
Note
The order of the fields in the UI follows the order in which these are added to the fields attribute in the Python SDK.
Check the Field - Python Reference to see the field classes in detail.
A CustomField
allows you to use a custom template for the field. This is useful if you want to use a custom UI for the field. You can use the template
argument to pass a string that will be rendered as the field's UI.
By default, advanced_mode=False
, which will use a brackets syntax engine for the templates. This engine converts {{record.fields.field.key}}
to the values of record's field's object. You can also use advanced_mode=True
, which deactivates the above brackets syntax engine and allows you to add custom javascript to your template to render the field.
rg.CustomField(
name="custom",
title="Custom",
template="<div>{{record.fields.custom.key}}</div>",
advanced_mode=False,
required=True,
description="Field description",
)
Tip
To learn more about how to create custom fields with HTML and CSS templates, check this how-to guide.
Questions¶
To collect feedback for your dataset, you need to formulate questions that annotators will be asked to answer.
Check the Questions - Python Reference to see the question classes in detail.
A LabelQuestion
asks annotators to choose a unique label from a list of options. This type is useful for text classification tasks. In the UI, they will have a rounded shape.
A MultiLabelQuestion
asks annotators to choose all applicable labels from a list of options. This type is useful for multi-label text classification tasks. In the UI, they will have a squared shape.
rg.MultiLabelQuestion(
name="multi_label",
labels={
"hate": "Hate Speech",
"sexual": "Sexual content",
"violent": "Violent content",
"pii": "Personal information",
"untruthful": "Untruthful info",
"not_english": "Not English",
"inappropriate": "Inappropriate content"
}, # or ["hate", "sexual", "violent", "pii", "untruthful", "not_english", "inappropriate"]
title="Does the response include any of the following?",
description="Select all that apply.",
required=True,
visible_labels=10,
labels_order="natural"
)
A RankingQuestion
asks annotators to order a list of options. It is useful to gather information on the preference or relevance of a set of options.
A RatingQuestion
asks annotators to select one option from a list of integer values. This type is useful for collecting numerical scores.
A SpanQuestion
asks annotators to select a portion of the text of a specific field and apply a label to it. This type of question is useful for named entity recognition or information extraction tasks.
rg.SpanQuestion(
name="span",
field="text",
labels={
"PERSON": "Person",
"ORG": "Organization",
"LOC": "Location",
"MISC": "Miscellaneous"
}, # or ["PERSON", "ORG", "LOC", "MISC"]
title="Select the entities in the text",
description="Select the entities in the text",
required=True,
allow_overlapping=False,
visible_labels=10
)
A TextQuestion
offers to annotators a free-text area where they can enter any text. This type is useful for collecting natural language data, such as corrections or explanations.
Metadata¶
Metadata properties allow you to configure the use of metadata information for the filtering and sorting features available in the UI and Python SDK.
Check the Metadata - Python Reference to see the metadata classes in detail.
A TermsMetadataProperty
allows to add a list of strings as metadata options.
An IntegerMetadataProperty
allows to add integer values as metadata.
Note
You can also set the allow_extra_metadata
argument in the dataset to True
to specify whether the dataset will allow metadata fields in the records other than those specified under metadata. Note that these will not be accessible from the UI for any user, only retrievable using the Python SDK.
Vectors¶
To use the similarity search in the UI and the Python SDK, you will need to configure vectors using the VectorField
class.
Check the Vector - Python Reference to see the
VectorField
class in detail.
Guidelines¶
Once you have decided on the data to show and the questions to ask, it's important to provide clear guidelines to the annotators. These guidelines help them understand the task and answer the questions consistently. You can provide guidelines in two ways:
- In the dataset guidelines: this is added as an argument when you create your dataset in the Python SDK. They will appear in the annotation interface.
guidelines = "In this dataset, you will find a collection of records that show a category, an instruction, a context and a response to that instruction. [...]"
- As question descriptions: these are added as an argument when you create questions in the Python SDK. This text will appear in a tooltip next to the question in the UI.
It is good practice to use at least the dataset guidelines if not both methods. Question descriptions should be short and provide context to a specific question. They can be a summary of the guidelines to that question, but often that is not sufficient to align the whole annotation team. In the guidelines, you can include a description of the project, details on how to answer each question with examples, instructions on when to discard a record, etc.
Tip
If you want further guidance on good practices for guidelines during the project development, check our blog post.
Distribution¶
When working as a team, you may want to distribute the annotation task to ensure efficiency and quality. You can use the TaskDistribution
settings to configure the number of minimum submitted responses expected for each record. Argilla will use this setting to automatically handle records in your team members' pending queues.
Check the Task Distribution - Python Reference to see the
TaskDistribution
class in detail.
To learn more about how to distribute the task among team members in the Distribute the annotation guide.
List datasets¶
You can list all the datasets available in a workspace using the datasets
attribute of the Workspace
class. You can also use len(workspace.datasets)
to get the number of datasets in a workspace.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
workspace = client.workspaces("my_workspace")
datasets = workspace.datasets
for dataset in datasets:
print(dataset)
When you list datasets, dataset settings are not preloaded, since this can introduce extra requests to the server. If you want to work with settings when listing datasets, you need to load them:
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
for dataset in client.datasets:
dataset.settings.get() # this will get the dataset settings from the server
print(dataset.settings)
Notebooks
When using a notebook, executing client.datasets
will display a table with the name
of the existing datasets, the id
, workspace_id
to which they belong, and the last update as updated_at
. .
Retrieve a dataset¶
You can retrieve a dataset by calling the datasets
method on the Argilla
class and passing the name
or id
of the dataset as an argument. If the dataset does not exist, a warning message will be raised and None
will be returned.
By default, this method attempts to retrieve the dataset from the first workspace. If the dataset is in a different workspace, you must specify either the workspace or workspace name as an argument.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
# Retrieve the dataset from the first workspace
retrieved_dataset = client.datasets(name="my_dataset")
# Retrieve the dataset from the specified workspace
retrieved_dataset = client.datasets(name="my_dataset", workspace="my_workspace")
Check dataset existence¶
You can check if a dataset exists. The client.datasets
method will return None
if the dataset was not found.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets(name="my_dataset")
if dataset is not None:
pass
Update a dataset¶
Once a dataset is published, there are limited things you can update. Here is a summary of the attributes you can change for each setting:
Attributes | From SDK | From UI |
---|---|---|
Name | ❌ | ❌ |
Title | ✅ | ✅ |
Required | ❌ | ❌ |
Use markdown | ✅ | ✅ |
Template | ✅ | ❌ |
Attributes | From SDK | From UI |
---|---|---|
Name | ❌ | ❌ |
Title | ❌ | ✅ |
Description | ❌ | ✅ |
Required | ❌ | ❌ |
Labels | ❌ | ❌ |
Values | ❌ | ❌ |
Label order | ❌ | ✅ |
Suggestions first | ❌ | ✅ |
Visible labels | ❌ | ✅ |
Field | ❌ | ❌ |
Allow overlapping | ❌ | ❌ |
Use markdown | ❌ | ✅ |
Attributes | From SDK | From UI |
---|---|---|
Name | ❌ | ❌ |
Title | ✅ | ✅ |
Options | ❌ | ❌ |
Minimum value | ❌ | ❌ |
Maximum value | ❌ | ❌ |
Visible for annotators | ✅ | ✅ |
Allow extra metadata | ✅ | ✅ |
Attributes | From SDK | From UI |
---|---|---|
Name | ❌ | ❌ |
Title | ✅ | ✅ |
Dimensions | ❌ | ❌ |
From SDK | From UI |
---|---|
✅ | ✅ |
Attributes | From SDK | From UI |
---|---|---|
Minimum submitted | ✅ | ✅ |
To modify these attributes, you can simply set the new value of the attributes you wish to change and call the update
method on the Dataset
object.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets("my_dataset")
dataset.settings.fields["text"].use_markdown = True
dataset.settings.metadata["my_metadata"].visible_for_annotators = False
dataset.update()
You can also add and delete metadata properties and vector fields using the add
and delete
methods.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets("my_dataset")
dataset.settings.vectors.add(rg.VectorField(name="my_new_vector", dimensions=123))
dataset.settings.metadata.add(
rg.TermsMetadataProperty(
name="my_new_metadata",
options=["option_1", "option_2", "option_3"],
),
)
dataset.update()
Delete a dataset¶
You can delete an existing dataset by calling the delete
method on the Dataset
class.