🧑‍💻 Create a dataset#

Feedback Dataset#

Warning

The dataset class covered in this section is the FeedbackDataset. This fully configurable dataset will replace the DatasetForTextClassification, DatasetForTokenClassification, and DatasetForText2Text in Argilla 2.0. Not sure which dataset to use? Check out our section on choosing a dataset.

The Feedback Task datasets allow you to combine multiple questions of different kinds, so the first step will be to define the aim of your project and the kind of data and feedback you will need to get there. With this information, you can start configuring a dataset and formatting records using the Python SDK.

This guide will walk you through all the elements you will need to configure to create a FeedbackDataset and add records to it.

Note

To follow the steps in this guide, you will first need to connect to Argilla. Check how to do so in our cheatsheet.

Add records#

At this point, we just need to add records to our FeedbackDataset. Take some time to explore and find data that fits the purpose of your project. If you are planning to use public data, the Datasets page of the Hugging Face Hub is a good place to start.

Tip

If you are using a public dataset, remember to always check the license to make sure you can legally employ it for your specific use case.

from datasets import load_dataset

# Load and inspect a dataset from the Hugging Face Hub
hf_dataset = load_dataset('databricks/databricks-dolly-15k', split='train')
df = hf_dataset.to_pandas()
df

Hint

Take some time to inspect the data before adding it to the dataset in case this triggers changes in the questions or fields.

The next step is to create records following Argilla’s FeedbackRecord format. These are the attributes of a FeedbackRecord:

fields: A dictionary with the name (key) and content (value) of each of the fields in the record. These will need to match the fields set up in the dataset configuration (see Define record fields).
external_id (optional): An ID of the record defined by the user. If there is no external ID, this will be None.
metadata (optional): A dictionary with the metadata of the record. This can include any information about the record that is not part of the fields. If you want the metadata to correspond with the metadata properties configured for your dataset, make sure that the key of the dictionary corresponds with the metadata property name. When the key doesn’t correspond, this will be considered extra metadata that will get stored with the record, but will not be usable for filtering and sorting. If there is no metadata, this will be None.
suggestions(optional): A list of all suggested responses for a record e.g., model predictions or other helpful hints for the annotators. Just one suggestion can be provided for each question, and suggestion values must be compliant with the pre-defined questions e.g. if we have a RatingQuestion between 1 and 5, the suggestion should have a valid value within that range. If suggestions are added, they will appear in the UI as pre-filled responses.
responses (optional): A list of all responses to a record. You will only need to add them if your dataset already has some annotated records. Make sure that the responses adhere to the same format as Argilla’s output and meet the schema requirements for the specific type of question being answered. Also make sure to include user_ids in case you’re planning to add more than one response for the same question, as only one user_id can be None, later to be replaced by the current active user_id, while the rest will be discarded otherwise.

# Create a single Feedback Record
record = rg.FeedbackRecord(
    fields={
        "question": "Why can camels survive long without water?",
        "answer": "Camels use the fat in their humps to keep them filled with energy and hydration for long periods of time."
    },
    metadata={"source": "encyclopedia"},
    external_id=None
)

As an example, here is how you can transform a whole dataset into records at once, renaming the fields and optionally filtering the original dataset:

records = [rg.FeedbackRecord(fields={"question": record["instruction"], "answer": record["response"]}) for record in hf_dataset if record["category"]=="open_qa"]

Now, we simply add our records to the dataset we configured above:

# Add records to the dataset
dataset.add_records(records)

Add suggestions#

Suggestions refer to suggested responses (e.g. model predictions) that you can add to your records to make the annotation process faster. These can be added during the creation of the record or at a later stage. Only one suggestion can be provided for each question, and suggestion values must be compliant with the pre-defined questions e.g. if we have a RatingQuestion between 1 and 5, the suggestion should have a valid value within that range.

Label

record = rg.FeedbackRecord(
    fields=...,
    suggestions = [
        {
            "question_name": "relevant",
            "value": "YES",
        }
    ]
)

Multi-label

record = rg.FeedbackRecord(
    fields=...,
    suggestions = [
        {
            "question_name": "content_class",
            "value": ["hate", "violent"]
        }
    ]
)

Ranking

record = rg.FeedbackRecord(
    fields=...,
    suggestions = [
        {
            "question_name": "preference",
            "value":[
                {"rank": 1, "value": "reply-2"},
                {"rank": 2, "value": "reply-1"},
                {"rank": 3, "value": "reply-3"},
            ],
        }
    ]
)

Rating

record = rg.FeedbackRecord(
    fields=...,
    suggestions = [
        {
            "question_name": "quality",
            "value": 5,
        }
    ]
)

Text

record = rg.FeedbackRecord(
    fields=...,
    suggestions = [
        {
            "question_name": "corrected-text",
            "value": "This is a *suggestion*.",
        }
    ]
)

You can also add suggestions to existing records that have been already pushed to Argilla:

import argilla as rg

rg.init(api_url="<ARGILLA_API_URL>", api_key="<ARGILLA_API_KEY>")

dataset = rg.FeedbackDataset.from_argilla(name="my_dataset", workspace="my_workspace")

Argilla 1.14.0 or higher

for record in dataset.records:
    record.update(suggestions=[{"question_name": "question", "value": ...}]) # Directly pushes the update to Argilla

Lower than Argilla 1.14.0

for record in dataset.records:
    record.set_suggestions([{"question_name": "question", "value": ...}])
dataset.push_to_argilla() # No need to provide `name` and `workspace` as has been retrieved via `from_argilla` classmethod

Add responses#

If your dataset includes some annotations, you can add those to the records as you create them. Make sure that the responses adhere to the same format as Argilla’s output and meet the schema requirements for the specific type of question being answered. Note that just one response with an empty user_id can be specified, as the first occurrence of user_id=None will be set to the active user_id, while the rest of the responses with user_id=None will be discarded.

Label

record = rg.FeedbackRecord(
    fields=...,
    responses = [
        {
            "values":{
                "relevant":{
                    "value": "YES"
                }
            }
        }
    ]
)

Multi-label

record = rg.FeedbackRecord(
    fields=...,
    responses = [
        {
            "values":{
                "content_class":{
                    "value": ["hate", "violent"]
                }
            }
        }
    ]
)

Ranking

record = rg.FeedbackRecord(
    fields=...,
    responses = [
        {
            "values":{
                "preference":{
                    "value":[
                        {"rank": 1, "value": "reply-2"},
                        {"rank": 2, "value": "reply-1"},
                        {"rank": 3, "value": "reply-3"},
                    ],
                }
            }
        }
    ]
)

Rating

record = rg.FeedbackRecord(
    fields=...,
    responses = [
        {
            "values":{
                "quality":{
                    "value": 5
                }
            }
        }
    ]
)

Text

record = rg.FeedbackRecord(
    fields=...,
    responses = [
        {
            "values":{
                "corrected-text":{
                    "value": "This is a *response*."
                }
            }
        }
    ]
)

Push to Argilla#

To import the dataset to your Argilla instance you can use the push_to_argilla method from your FeedbackDataset instance. Once pushed, you will be able to see your dataset in the UI.

Note

From Argilla 1.14.0, calling push_to_argilla will not just push the FeedbackDataset into Argilla, but will also return the remote FeedbackDataset instance, which implies that the additions, updates, and deletions of records will be pushed to Argilla as soon as they are made. This is a change from previous versions of Argilla, where you had to call push_to_argilla again to push the changes to Argilla.

Argilla 1.14.0 or higher

remote_dataset = dataset.push_to_argilla(name="my-dataset", workspace="my-workspace")

Lower than Argilla 1.14.0

dataset.push_to_argilla(name="my-dataset", workspace="my-workspace")

Now you’re ready to start the annotation process.

🧑‍💻 Create a dataset#

Feedback Dataset#

Configure the dataset#

Task Templates#

Custom Configuration#

Define `fields`#

Define `questions`#

Define metadata properties#

Define `guidelines`#

Create the dataset#

Add records#

Add suggestions#

Add responses#

Push to Argilla#

Other Datasets#

Configure the Dataset#

Add records#

Add suggestions#

Add annotations#

Push to Argilla#

🧑‍💻 Create a dataset#

Feedback Dataset#

Configure the dataset#

Task Templates#

Custom Configuration#

Define fields#

Define questions#

Define metadata properties#

Define guidelines#

Create the dataset#

Add records#

Add suggestions#

Add responses#

Push to Argilla#

Other Datasets#

Configure the Dataset#

Add records#

Add suggestions#

Add annotations#

Push to Argilla#

Define `fields`#

Define `questions`#

Define `guidelines`#