Open In Colab  View Notebook on GitHub

Train a Model with ArgillaTrainer#

In this part of our end-to-end series, we will be training our model with ArgillaTrainer. You can refer to previous tutorials such as creating the dataset or adding responses and suggestions. Feel free to check out the practical guides page for more in-depth information.

Fine-tuning is a process of training a model with a dataset that is similar to the one it was trained on. This way, the model can learn to perform better on a specific task. In our case, we will be fine-tuning a model with the dataset we created in the previous tutorial. Note that Argilla trainer is a wrapper around many different frameworks and models. Please have a look at the ArgillaTrainer page for more information on the NLP tasks and models that are supported.

workflow

Table of Contents#

  1. Pull the Dataset

    1. From Argilla

    2. From HuggingFace Hub

  2. Train with Default ArgillaTrainer

  3. Train with Custom Formatting Function

  4. Push the Model to Huggingface Hub

  5. Conclusion

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

deploy on spaces

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argillaโ€™s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

  • Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโ€™t forget to change the runtime type to GPU for faster model training and inference.

  • Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

First, letโ€™s install our dependencies and import the necessary libraries:

[ ]:
!pip install argilla
!pip install datasets transformers
[ ]:
import argilla as rg
from argilla.feedback import FeedbackDataset, TrainingTask, ArgillaTrainer
from datasets import load_dataset
from argilla._constants import DEFAULT_API_KEY

In order to run this notebook we will need some credentials to push and load datasets from Argilla and ๐Ÿค— Hub, letโ€™s set them in the following cell:

[ ]:
# Argilla credentials
api_url = "http://localhost:6900"  # "https://<YOUR-HF-SPACE>.hf.space"
api_key = DEFAULT_API_KEY  # admin.apikey
# Huggingface credentials
hf_token = "hf_..."

Log in to Argilla:

[ ]:
rg.init(api_url=api_url, api_key=api_key)

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

Pull the Dataset#

As we uploaded the dataset that we created in the previous tutorial to both Argilla and HuggingFace Hub, we can pull the dataset from either of them. Let us see how we can pull the dataset from both.

From Argilla#

We can pull the dataset from Argilla by using the from_argilla method.

[ ]:
dataset = rg.FeedbackDataset.from_argilla("end2end_textclassification_with_suggestions_and_responses")

From HuggingFace Hub#

We can also pull the dataset from HuggingFace Hub. Similarly, we can use the from_huggingface method to pull the dataset.

[ ]:
dataset = rg.FeedbackDataset.from_huggingface("argilla/end2end_textclassification_with_suggestions_and_responses")

Note

The dataset pulled from HuggingFace Hub is an instance of FeedbackDataset whereas the dataset pulled from Argilla is an instance of RemoteFeedbackDataset. The difference between the two is that the former is a local one and the changes made on it stay locally. On the other hand, the latter is a remote one and the changes made on it are directly reflected on the dataset on the Argilla server, which can make your process faster.

Let us briefly examine what our dataset looks like. It is a dataset that consists of data items with the field text that is yet to be annotated.

[ ]:
dataset[0].fields
{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again."}

Train with Default ArgillaTrainer Settings#

As we previously mentioned, you can employ ArgillaTrainer for most of your favorite NLP tasks and libraries. What it does is simplify the training process by providing a unified interface for training and fine-tuning models. Let us first see how we can train our model with the default settings.

TrainingTask is the class that is responsible for how to process the data for training. To make things even easier, it offers task-specific methods that you can use for your particular task at hand. For this tutorial, we will be using the for_text_classification method, which will prepare a default data processing function for us. You can see a list of all the available task-specific methods in the documentation.

[ ]:
task = TrainingTask.for_text_classification(
    text=dataset.field_by_name("text"),
    label=dataset.question_by_name("label"),
)

And, we need to define an ArgillaTrainer for our task that will be used for training. You can define the framework you want to work with in the framework parameter. For this tutorial, we will be using the en_core_web_sm model from the spacy framework.

[ ]:
trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="spacy",
    train_size=0.8,
    model="en_core_web_sm",
)

Let us customize the training process a bit more with the update_config method. We specify a model we want to use from HuggingFace. We also specify the number of epochs and the learning rate. You can have a look at the comprehensive list of training configs for more customization options.

[ ]:
trainer.update_config(
    max_steps=1
)

We are already good to go with the default settings. We only need to call the train method to train our model.

[ ]:
trainer.train(output_dir="textcat_model_spacy")

Train with Custom Formatting Function#

The default settings might not be the best for your tasks sometimes. In those cases, you might need to customize the data processing function. You can do that by passing a custom formatting function to the for_text_classification method. Let us first define a custom formatting function and then train our model with it.

It is important to note that the custom formatting function should return an object that will be accepted by the model. In our case, we will be returning a tuple of texts and labels.

[ ]:
def formatting_func(sample):
    text = sample["text"]
    label = sample["label"][0]["value"]
    return(text, label)

Now, we can feed the TrainingTasks with our custom formatting function.

[ ]:
task = TrainingTask.for_text_classification(formatting_func=formatting_func)

For this example, we can use the spacy framework.

[ ]:
trainer = ArgillaTrainer(
    dataset=dataset,
    task=task,
    framework="spacy",
    train_size=0.8,
)

Let us customize the training process a bit more with the update_config method. We specify a model we want to use from HuggingFace. We also specify the number of epochs and the learning rate. You can have a look at the comprehensive list of training configs for more customization options.

[ ]:
trainer.update_config(
    max_steps=1
)

Again, we just need to call the train method to train our model.

[ ]:
trainer.train(output_dir="textcat_model_spacy")

Push the Model to HuggingFace Hub#

Now that we completed the training and we have a model that can be used for inference, we can push it to HuggingFace Hub. Argilla again offers quite a simple way to accomplish this with just a single line of code.

First, to be able to upload your model to the Hub, you must be logged in to the Hub. The following cell will log us with our previous token.

If we donโ€™t have one already, we can obtain it from here (remember to set the write access).

[ ]:
from huggingface_hub import login

login(token=hf_token)

We only need to call the push_to_huggingface method to push the model to HuggingFace Hub. Do not forget to create a model card as well, which will make the model more understandable for the users.

[ ]:
trainer.push_to_huggingface("textcat_model_spacy", generate_card=True)

Conclusion#

In this tutorial, we have seen how to train a model with ArgillaTrainer. We have also customized the training process with custom formatting functions and training configs. After training, we have pushed our model to HuggingFace Hub with the push_to_huggingface method. You can refer to the metrics tutorial to see how you can evaluate your model.