๐Ÿ’ช๐Ÿฝ Training#

Data Preparation#

Once you have uploaded and annotated your dataset in Argilla, you are ready to prepare it for training a model. Most NLP models today are trained via supervised learning and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation.

Manual extraction#

The exact data format for training a model depends on your training framework and the task you are tackling (text classification, token classification, etc.). Argilla is framework agnostic; you can always manually extract from the records what you need for the training.

The extraction happens using the client library within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from Argilla.

[ ]:
import argilla as rg

dataset = rg.load("my_annotated_dataset")

Note

If you follow a weak supervision approach, the steps are slightly different. We refer you to our weak supervision guide for a complete workflow.

Then we can iterate over the records and extract our training examples. For example, letโ€™s assume you want to train a text classifier with a sklearn pipeline that takes as input a text and outputs a label.

[ ]:
# Save the inputs and labels in Python lists
inputs, labels = [], []

# Iterate over the records in the dataset
for record in dataset:
    # We only want records with annotations
    if record.annotation:
        inputs.append(record.text)
        labels.append(record.annotation)

# Train the model
sklearn_pipeline.fit(inputs, labels)

Automatic extraction#

For a few frameworks and tasks, Argilla provides a convenient method to automatically extract training examples in a suitable format from a dataset.

For example: If you want to train a transformers model for text classification, you can load an annotated dataset for text classification and call the prepare_for_training() method:

[ ]:
dataset = rg.load("my_annotated_dataset")

dataset_for_training = dataset.prepare_for_training()

With the returned dataset_for_training, you can continue following the steps to fine-tune a pre-trained model with the transformers library.

Check the dedicated dataset guide for more examples of the prepare_for_training() method.

How to train your model#

Argilla helps you to create and curate training data. It is not a framework for training a model. You can use Argilla complementary with other excellent open-source frameworks that focus on developing and training NLP models.

Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:

  • transformers: This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;

  • spaCy: This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;

  • scikit-learn: This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly;