Once you have uploaded and annotated your dataset in Argilla, you are ready to prepare it for training a model. Most NLP models today are trained via supervised learning and need input-output pairs to serve as training examples for the model. The input part of such pairs is usually the text itself, while the output is the corresponding annotation.
The exact data format for training a model depends on your training framework and the task you are tackling (text classification, token classification, etc.). Argilla is framework agnostic; you can always manually extract from the records what you need for the training.
The extraction happens using the client library within a Python script, a Jupyter notebook, or another IDE. First, we have to load the annotated dataset from Argilla.
import argilla as rg dataset = rg.load("my_annotated_dataset")
If you follow a weak supervision approach, the steps are slightly different. We refer you to our weak supervision guide for a complete workflow.
Then we can iterate over the records and extract our training examples. For example, let’s assume you want to train a text classifier with a sklearn pipeline that takes as input a text and outputs a label.
# Save the inputs and labels in Python lists inputs, labels = ,  # Iterate over the records in the dataset for record in dataset: # We only want records with annotations if record.annotation: inputs.append(record.text) labels.append(record.annotation) # Train the model sklearn_pipeline.fit(inputs, labels)
For a few frameworks and tasks, Argilla provides a convenient method to automatically extract training examples in a suitable format from a dataset.
For example: If you want to train a transformers model for text classification, you can load an annotated dataset for text classification and call the
dataset = rg.load("my_annotated_dataset") dataset_for_training = dataset.prepare_for_training()
Check the dedicated dataset guide for more examples of the
How to train your model#
Argilla helps you to create and curate training data. It is not a framework for training a model. You can use Argilla complementary with other excellent open-source frameworks that focus on developing and training NLP models.
Here we list three of the most commonly used open-source libraries, but many more are available and may be more suited for your specific use case:
transformers: This library provides thousands of pre-trained models for various NLP tasks and modalities. Its excellent documentation focuses on fine-tuning those models to your specific use case;
spaCy: This library also comes with pre-trained models built into a pipeline tackling multiple tasks simultaneously. Since its a purely NLP library, it comes with much more NLP features than just model training;
scikit-learn: This de facto standard library is a powerful swiss army knife for machine learning with some NLP support. Usually, their NLP models lack the performance when compared to transformers or spacy, but give it a try if you want to train a lightweight model quickly;