๐ช๐ฝ Training#
Data Preparation#
Once you have uploaded and annotated your dataset in Argilla, you are ready to prepare it for training a model. Most NLP models today are trained via supervised learning and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation.
Manual extraction#
The exact data format for training a model depends on your training framework and the task you are tackling (text classification, token classification, etc.). Argilla is framework agnostic; you can always manually extract from the records what you need for the training.
The extraction happens using the client library within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from Argilla.
[ ]:
import argilla as rg
dataset = rg.load("my_annotated_dataset")
Note
If you follow a weak supervision approach, the steps are slightly different. We refer you to our weak supervision guide for a complete workflow.
Then we can iterate over the records and extract our training examples. For example, letโs assume you want to train a text classifier with a sklearn pipeline that takes as input a text and outputs a label.
[ ]:
# Save the inputs and labels in Python lists
inputs, labels = [], []
# Iterate over the records in the dataset
for record in dataset:
# We only want records with annotations
if record.annotation:
inputs.append(record.text)
labels.append(record.annotation)
# Train the model
sklearn_pipeline.fit(inputs, labels)
Automatic extraction#
For a few frameworks and tasks, Argilla provides a convenient method to automatically extract training examples in a suitable format from a dataset.
For example: If you want to train a transformers model for text classification, you can load an annotated dataset for text classification and call the prepare_for_training()
method:
[ ]:
dataset = rg.load("my_annotated_dataset")
dataset_for_training = dataset.prepare_for_training()
With the returned dataset_for_training
, you can continue following the steps to fine-tune a pre-trained model with the transformers library.
Check the dedicated dataset guide for more examples of the prepare_for_training()
method.
How to train your model#
Argilla helps you to create and curate training data. It is not a framework for training a model. You can use Argilla complementary with other excellent open-source frameworks that focus on developing and training NLP models.
Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:
transformers: This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;
spaCy: This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;
scikit-learn: This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly;