Add, update, and delete records¶
This guide provides an overview of records, explaining the basics of how to define and manage them in Argilla.
A record in Argilla is a data item that requires annotation, consisting of one or more fields. These are the pieces of information displayed to the user in the UI to facilitate the completion of the annotation task. Each record also includes questions that annotators are required to answer, with the option of adding suggestions and responses to assist them. Guidelines are also provided to help annotators effectively complete their tasks.
A record is part of a dataset, so you will need to create a dataset before adding records. Check this guide to learn how to create a dataset.
Main Class
rg.Record(
external_id="1234",
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
metadata={
"category": "A"
},
vectors={
"my_vector": [0.1, 0.2, 0.3],
},
suggestions=[
rg.Suggestion("my_label", "positive", score=0.9, agent="model_name")
],
responses=[
rg.Response("label", "positive", user_id=user_id)
],
)
Check the Record - Python Reference to see the attributes, arguments, and methods of the
Record
class in detail.
Add records¶
You can add records to a dataset in two different ways: either by using a dictionary or by directly initializing a Record
object. You should ensure that fields, metadata and vectors match those configured in the dataset settings. In both cases, are added via the Dataset.records.log
method. As soon as you add the records, these will be available in the Argilla UI. If they do not appear in the UI, you may need to click the refresh button to update the view.
Tip
Take some time to inspect the data before adding it to the dataset in case this triggers changes in the questions
or fields
.
Note
If you are planning to use public data, the Datasets page of the Hugging Face Hub is a good place to start. Remember to always check the license to make sure you can legally use it for your specific use case.
You can add records to a dataset by initializing a Record
object directly. This is ideal if you need to apply logic to the data before defining the record. If the data is already structured, you should consider adding it directly as a dictionary or Hugging Face dataset.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets(name="my_dataset")
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
), # (1)
]
dataset.records.log(records)
- This is an illustration of a definition. In a real-world scenario, you would iterate over a data structure and create
Record
objects for each iteration.
You can add the data directly as a dictionary like structure, where the keys correspond to the names of fields, questions, metadata or vectors in the dataset and the values are the data to be added.
If your data structure does not correspond to your Argilla dataset names, you can use a mapping
to indicate which keys in the source data correspond to the dataset fields, metadata, vectors, suggestions, or responses. If you need to add the same data to multiple attributes, you can also use a list with the name of the attributes.
We illustrate this python dictionaries that represent your data, but we would not advise you to define dictionaries. Instead, use the Record
object to instantiate records.
import argilla as rg
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets(name="my_dataset")
# Add records to the dataset with the fields 'question' and 'answer'
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
}, # (1)
]
dataset.records.log(data)
# Add records to the dataset with a mapping of the fields 'question' and 'answer'
data = [
{
"query": "Do you need oxygen to breathe?",
"response": "Yes",
},
{
"query": "What is the boiling point of water?",
"response": "100 degrees Celsius",
},
]
dataset.records.log(data, mapping={"query": "question", "response": "answer"}) # (2)
- The data structure's keys must match the fields or questions in the Argilla dataset. In this case, there are fields named
question
andanswer
. - The data structure has keys
query
andresponse
, and the Argilla dataset has fieldsquestion
andanswer
. You can use themapping
parameter to map the keys in the data structure to the fields in the Argilla dataset.
You can also add records to a dataset using a Hugging Face dataset. This is useful when you want to use a dataset from the Hugging Face Hub and add it to your Argilla dataset.
You can add the dataset where the column names correspond to the names of fields, metadata or vectors in the Argilla dataset.
import argilla as rg
from datasets import load_dataset
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
dataset = client.datasets(name="my_dataset") # (1)
hf_dataset = load_dataset("imdb", split="train[:100]") # (2)
dataset.records.log(records=hf_dataset)
-
In this case, we are using the
my_dataset
dataset from the Argilla workspace. The dataset has atext
field and alabel
question. -
In this example, the Hugging Face dataset matches the Argilla dataset schema. If that is not the case, you could use the
.map
of thedatasets
library to prepare the data before adding it to the Argilla dataset.
If the Hugging Face dataset's schema does not correspond to your Argilla dataset field names, you can use a mapping
to specify the relationship. You should indicate as key the column name of the Hugging Face dataset and, as value, the field name of the Argilla dataset.
- In this case, the
text
key in the Hugging Face dataset would correspond to thereview
field in the Argilla dataset, and thelabel
key in the Hugging Face dataset would correspond to thesentiment
field in the Argilla dataset.
Fields¶
Fields are the main pieces of information of the record. These are shown at first sight in the UI together with the questions form. You may only include fields that you have previously configured in the dataset settings. Depending on the type of fields included in the dataset, the data format may be slightly different:
Text fields expect input in the form of a string
.
Image fields expect a remote URL or local path to an image file in the form of a string
, or a PIL object.
Check the Dataset.records - Python Reference to see how to add records with with images in detail.
Chat fields expect a list of dictionaries with the keys role
and content
, where the role
identifies the interlocutor type (e.g., user, assistant, model, etc.), whereas the content
contains the text of the message.
Metadata¶
Record metadata can include any information about the record that is not part of the fields in the form of a dictionary. To use metadata for filtering and sorting records, make sure that the key of the dictionary corresponds with the metadata property name
. When the key doesn't correspond, this will be considered extra metadata that will get stored with the record (as long as allow_extra_metadata
is set to True
for the dataset), but will not be usable for filtering and sorting.
Note
Remember that to use metadata within a dataset, you must define a metadata property in the dataset settings.
Check the Metadata - Python Reference to see the attributes, arguments, and methods for using metadata in detail.
You can add metadata to a record in an initialized Record
object.
# Add records to the dataset with the metadata 'category'
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
metadata={"my_metadata": "option_1"},
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
metadata={"my_metadata": "option_1"},
),
]
dataset.records.log(records)
You can add metadata to a record directly as a dictionary structure, where the keys correspond to the names of metadata properties in the dataset and the values are the metadata to be added. Remember that you can also use the mapping
parameter to specify the data structure.
# Add records to the dataset with the metadata 'category'
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
"my_metadata": "option_1",
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
"my_metadata": "option_1",
},
]
dataset.records.log(data)
Vectors¶
You can associate vectors, like text embeddings, to your records. They can be used for semantic search in the UI and the Python SDK. Make sure that the length of the list corresponds to the dimensions set in the vector settings.
Note
Remember that to use vectors within a dataset, you must define them in the dataset settings.
Check the Vector - Python Reference to see the attributes, arguments, and methods of the
Vector
class in detail.
You can also add vectors to a record in an initialized Record
object.
# Add records to the dataset with the vector 'my_vector' and dimension=3
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
vectors={
"my_vector": [0.1, 0.2, 0.3]
},
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
vectors={
"my_vector": [0.2, 0.5, 0.3]
},
),
]
dataset.records.log(records)
You can add vectors from a dictionary-like structure, where the keys correspond to the name
s of the vector settings that were configured for your dataset and the value is a list of floats. Remember that you can also use the mapping
parameter to specify the data structure.
# Add records to the dataset with the vector 'my_vector' and dimension=3
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
"my_vector": [0.1, 0.2, 0.3],
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
"my_vector": [0.2, 0.5, 0.3],
},
]
dataset.records.log(data)
Suggestions¶
Suggestions refer to suggested responses (e.g. model predictions) that you can add to your records to make the annotation process faster. These can be added during the creation of the record or at a later stage. Only one suggestion can be provided for each question, and suggestion values must be compliant with the pre-defined questions e.g. if we have a RatingQuestion
between 1 and 5, the suggestion should have a valid value within that range.
Check the Suggestions - Python Reference to see the attributes, arguments, and methods of the
Suggestion
class in detail.
Tip
Check the Suggestions - Python Reference for different formats per Question
type.
You can also add suggestions to a record in an initialized Record
object.
# Add records to the dataset with the label 'my_label'
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
suggestions=[
rg.Suggestion(
"my_label",
"positive",
score=0.9,
agent="model_name"
)
],
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
suggestions=[
rg.Suggestion(
"my_label",
"negative",
score=0.9,
agent="model_name"
)
],
),
]
dataset.records.log(records)
You can add suggestions as a dictionary, where the keys correspond to the name
s of the labels that were configured for your dataset. Remember that you can also use the mapping
parameter to specify the data structure.
# Add records to the dataset with the label question 'my_label'
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
"label": "positive",
"score": 0.9,
"agent": "model_name",
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
"label": "negative",
"score": 0.9,
"agent": "model_name",
},
]
dataset.records.log(
data=data,
mapping={
"label": "my_label",
"score": "my_label.suggestion.score",
"agent": "my_label.suggestion.agent",
},
)
Responses¶
If your dataset includes some annotations, you can add those to the records as you create them. Make sure that the responses adhere to the same format as Argilla's output and meet the schema requirements for the specific type of question being answered. Make sure to include the user_id
in case you're planning to add more than one response for the same question, if not responses will apply to all the annotators.
Check the Responses - Python Reference to see the attributes, arguments, and methods of the
Response
class in detail.
Note
Keep in mind that records with responses will be displayed as "Draft" in the UI.
Tip
Check the Responses - Python Reference for different formats per Question
type.
You can also add suggestions to a record in an initialized Record
object.
# Add records to the dataset with the label 'my_label'
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
responses=[
rg.Response("my_label", "positive", user_id=user.id)
]
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
responses=[
rg.Response("my_label", "negative", user_id=user.id)
]
),
]
dataset.records.log(records)
You can add suggestions as a dictionary, where the keys correspond to the name
s of the labels that were configured for your dataset. Remember that you can also use the mapping
parameter to specify the data structure. If you want to specify the user that added the response, you can use the user_id
parameter.
# Add records to the dataset with the label 'my_label'
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
"label": "positive",
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
"label": "negative",
},
]
dataset.records.log(data, user_id=user.id, mapping={"label": "my_label.response"})
List records¶
To list records in a dataset, you can use the records
method on the Dataset
object. This method returns a list of Record
objects that can be iterated over to access the record properties.
for record in dataset.records(
with_suggestions=True,
with_responses=True,
with_vectors=True
):
# Access the record properties
print(record.metadata)
print(record.vectors)
print(record.suggestions)
print(record.responses)
# Access the responses of the record
for response in record.responses:
print(response.value)
Update records¶
You can update records in a dataset by calling the log
method on the Dataset
object. To update a record, you need to provide the record id
and the new data to be updated.
data = dataset.records.to_list(flatten=True)
updated_data = [
{
"text": sample["text"],
"label": "positive",
"id": sample["id"],
}
for sample in data
]
dataset.records.log(records=updated_data)
The metadata
of the Record
object is a python dictionary. To update it, you can iterate over the records and update the metadata by key. After that, you should update the records in the dataset.
Tip
Check the Metadata - Python Reference for different formats per MetadataProperty
type.
If a new vector field is added to the dataset settings or some value for the existing record vectors must be updated, you can iterate over the records and update the vectors by key. After that, you should update the records in the dataset.
If some value for the existing record suggestions must be updated, you can iterate over the records and update the suggestions by key. You can also add a suggestion using the add
method. After that, you should update the records in the dataset.
Tip
Check the Suggestions - Python Reference for different formats per Question
type.
updated_records = []
for record in dataset.records(with_suggestions=True):
# We can update existing suggestions
record.suggestions["label"].value = "new_value"
record.suggestions["label"].score = 0.9
record.suggestions["label"].agent = "model_name"
# We can also add new suggestions with the `add` method:
if not record.suggestions["label"]:
record.suggestions.add(
rg.Suggestion("value", "label", score=0.9, agent="model_name")
)
updated_records.append(record)
dataset.records.log(records=updated_records)
If some value for the existing record responses must be updated, you can iterate over the records and update the responses by key. You can also add a response using the add
method. After that, you should update the records in the dataset.
Tip
Check the Responses - Python Reference for different formats per Question
type.
updated_records = []
for record in dataset.records(with_responses=True):
for response in record.responses["label"]:
if response:
response.value = "new_value"
response.user_id = "existing_user_id"
else:
record.responses.add(rg.Response("label", "YES", user_id=user.id))
updated_records.append(record)
dataset.records.log(records=updated_records)
Delete records¶
You can delete records in a dataset calling the delete
method on the Dataset
object. To delete records, you need to retrieve them from the server and get a list with those that you want to delete.
Delete records based on a query
It can be very useful to avoid eliminating records with responses.
For more information about the query syntax, check this how-to guide.