🤯 Few-shot classification with SetFit#
SetFit is an exciting open-source package for Few-shot classification developed by teams at Hugging Face and Intel Labs. You can read all about it on the project repository.
To showcase how powerful is the combination of SetFit and Argilla:
We manually label 55 examples from the unlabelled split of the imdb dataset,
we train a model in 5 min,
and without using a single example from the original imdb training set, we achieve 0.9 accuracy on the full test set!
In this tutorial, you’ll learn to:
Load a unlabelled dataset in Argilla. We’ll be using the unlabelled split from the
imdbmovie reviews sentiment dataset. This same workflow can be applied to any custom dataset, problem, and language!
Manually label a FEW examples using the UI.
Train a SetFit model to get highly competitive results. For this example, with only 55 examples, we get 0.9 accuracy on the test set which is comparable to models fine-tuned on 3K examples. That means similar performance with
50xless examples 🤯.
For reference see the Hugging Face Hub and PapersWithCode leaderboards.
Let’s get started!
For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:
Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:
For details about configuring your deployment, check the official Hugging Face Hub guide.
Launch Argilla using Argilla’s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.
For more information on deployment options, please check the Deployment section of the documentation.
This tutorial is a Jupyter Notebook. There are two options to run it:
Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don’t forget to change the runtime type to GPU for faster model training and inference.
Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
%pip install argilla "setfit~=0.2.0" "datasets~=2.3.0" -qqq
Let’s import the Argilla module for reading and writing data:
import argilla as rg
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the
# Replace api_url with the url to your HF Spaces URL if using Spaces # Replace api_key if you configured a custom API key rg.init( api_url="http://localhost:6900", api_key="team.apikey" )
Let’s import the modules we need:
from datasets import load_dataset from sentence_transformers.losses import CosineSimilarityLoss from setfit import SetFitModel, SetFitTrainer
Load unlabelled dataset in Argilla#
First, we load the
unsupervised split from the
imdb dataset and create a new Argilla dataset with 100 random examples:
unlabelled = ( load_dataset("imdb", split="unsupervised").shuffle(seed=42).select(range(100)) ) unlabelled = rg.DatasetForTextClassification.from_datasets(unlabelled) rg.log(unlabelled, "imdb_unlabelled")
In this step, we create the labels
neg using the same label scheme as the original dataset. Then we use the UI to sequentially label a few examples. For the example, we spent literally 15 minutes.
Before training, you can easily share the dataset using the
push_to_hub method. This might be useful if you don’t have a GPU on your machine and want to use a training service or Colab for example.
Train and evaluate SetFit model#
Finally, we are ready to test SetFit!
Thanks to Argilla’s integration with
datasets and the Hub, if you don’t have a local GPU you can use this Google Colab to reproduce the training process with the labelled dataset. If you use a GPU runtime, it literally takes 5 minutes to train.
Below we load the dataset from Argilla, format it for training with transformers, load the full imbd test dataset, load a pre-trained sentence transformers model, train the SetFit model, and evaluate it!
# Load the handlabelled dataset from Argilla train_ds = rg.load("imdb_unlabelled").prepare_for_training() # Load the full imdb test dataset test_ds = load_dataset("imdb", split="test") # Load SetFit model from Hub model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2") # Create trainer trainer = SetFitTrainer( model=model, train_dataset=train_ds, eval_dataset=test_ds, loss_class=CosineSimilarityLoss, batch_size=16, num_iterations=20, # The number of text pairs to generate ) # Train and evaluate trainer.train() metrics = trainer.evaluate()
Optionally, you can share your amazing model with the world!
The metrics object should give you around 0.9 accuracy on the full test set 🎉
We have manually labelled 55 examples,
We haven’t used a single example from the original training set,
and we’ve trained the model in 5 min!
Now, I don’t think you have any more excuses to not invest some time labeling a few good quality examples!
If you are interested in SetFit, you can check our other SetFit + Argilla tutorials:
Or check out the SetFit repository on GitHub.
If you want to continue learning Argilla:
🙋♀️ Join the Argilla Slack community!
⭐ Argilla Github repo to stay updated.
📚 Argilla documentation for more guides and tutorials.