Open In Colab  View Notebook on GitHub

๐Ÿฉน Delete labels from a Token or Text Classification dataset#

Itโ€™s not uncommon to find yourself wanting to delete one of the labels in your dataset, maybe because you changed your mind or because you want to correct the name of the label. However, this is not a trivial change, as it has implications down the line if the dataset already has annotations and can trigger errors.

In this tutorial, you will learn how to delete, modify or merge labels to deal with this situation when using Token and Text Classification datasets.

Letโ€™s get started!

Note

This tutorial is a Jupyter Notebook. There are two options to run it:

  • Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโ€™t forget to change the runtime type to GPU for faster model training and inference.

  • Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

Setup#

For this tutorial, you will need to have an Argilla server running. If you donโ€™t have one already, check out our Quickstart or Installation pages. Once you do, complete the following steps:

  1. Install the Argilla client and the required third-party libraries using pip:

[ ]:
%pip install --upgrade argilla -qqq
  1. Letโ€™s make the necessary imports:

[ ]:
import argilla as rg
  1. If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

[ ]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
rg.init(
    api_url="http://localhost:6900",
    api_key="admin.apikey"
)

If youโ€™re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

[ ]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

First steps#

Letโ€™s set some variables to avoid making mistakes down the line.

[ ]:
# save the name of the dataset that we will be working with
dataset_name = "my_dataset"

# and set the workspace where the dataset is located
rg.set_workspace("my_workspace")

Optionally, you can create a backup of the dataset in case we want to revert the changes. To do that, you may want to create a workspace dedicated to saving backups and copy the dataset there.

[ ]:
# optional: create a new workspace for the backups.
backups_ws = rg.Workspace.create("backups")
[ ]:
# optional: if you want users without the owner role to have access to this workspace
# change `username` and run this cell.
user = rg.User.from_name("username")
backups_ws.add_user(user.id)
[ ]:
# copy the dataset in the new workspace
rg.copy(dataset_name, name_of_copy=f"{dataset_name}_backup", workspace=backups_ws.name)

Letโ€™s load the settings and take a look at the available labels.

Tip

Use the result to copy-paste the name(s) of the label(s) you will use to avoid mistakes.

[ ]:
settings = rg.load_dataset_settings(dataset_name)
[ ]:
# run this cell if you need to read or copy the labels
settings.label_schema

Now, save some variables with the label that you want to change (old_label) and what you want to change it to (new_label). Depending on what you intend to do, you will choose between one of these options:

  1. If you want to change the text of the label, you will save the new text in new_label.

  2. If you want to merge the annotations of one label with another existing label, you will save the label you wish to remove in old_label and the label that will contain the annotations now in new_label.

  3. If you want to remove a label and all its annotations, you will need to delete/comment out new_label or set it to None.

[ ]:
# set the old and new labels as variables, to avoid errors down the line
old_label = "old_label"
# comment out or set to None if you want to remove the label
new_label = "new_label"

If you are using the new_label variable to add a label that isnโ€™t present in the current schema, you will need to add it now. If not, skip the following cell.

[ ]:
# add any labels that were not present in the original settings
settings.label_schema.append(new_label)

Remove the unwanted label from the records#

Before you can change the settings of our dataset, you will need to remove the label that you want to delete from all annotations and predictions in the records, otherwise, youโ€™ll get an error. To do that, first, fetch all the records that have the label using a query.

[ ]:
# get all records with the old label in the annotations or predictions
records = rg.load(dataset_name, query=f"annotated_as:{old_label} OR predicted_as:{old_label}")
len(records)

Now, you can clean all the examples of our label inside the annotations and predictions.

[ ]:
def cleaning_function(labels, old_label, new_label):

    # replaces / removes string labels (e.g. TextClassification)
    if isinstance(labels, str):
        if labels==old_label:
            labels = new_label

    elif isinstance(labels, list):
        # replaces / removes labels in a list (e.g. multi-label TextClassification)
        if isinstance(labels[0], str):
            if old_label in labels:
                if new_label == None:
                    labels.remove(old_label)
                else:
                    labels = [new_label if label == old_label else label for label in labels]

        # replaces / removes lables in a list of tuples (e.g. Predictions, TokenClassification)
        elif isinstance(labels[0], tuple):
            for ix,label in enumerate(labels):
                if label[0]==old_label:
                    if new_label == None:
                        labels.remove(label)
                    else:
                        new_label = list(label)
                        new_label[0] = new_label
                        labels[ix] = tuple(new_label)

    return labels
[ ]:
# loop over the records and make the correction in the predictions and annotations
for record in records:
    if record.prediction:
        record.prediction = cleaning_function(record.prediction, old_label, new_label)
    if record.annotation:
        record.annotation = cleaning_function(record.annotation, old_label, new_label)
        record.status = "Default"

Hint

If you are changing the name of the label to correct a typo or you are removing the label from a Token Classification dataset or a multi-label Text Classification dataset, you may skip changing the status of the records to Default.

Warning

If you are replacing one label with another, it is highly recommended to change the status to Default so that you can double-check during annotation that the new label applies in all cases. If you are removing a label from a single-label Text Classification dataset you will always need to set the status of the record to Default.

After modifying the records, log them back into their original dataset to save the changes.

[ ]:
# log the corrected records
rg.log(records, name=dataset_name)

Update dataset settings#

Now that the label is not present in the records, you can modify the dataset settings, remove the unwanted label and save the new configuration of the dataset.

[ ]:
# remove the unwanted label from the labelling schema
settings.label_schema.remove(old_label)
[ ]:
# change the configuration of the dataset
rg.configure_dataset_settings(name=dataset_name, settings=settings)

Now the unwanted label should be gone from annotations, predictions and dataset settings.

Summary#

In this tutorial, you have learned how to delete or modify a label from a Token or Text Classification dataset when annotations are already present. This notebook contains code so that you can change the name of the label, merge the annotations with another existing label or remove the label altogether.