Open In Colab  View Notebook on GitHub

๐Ÿคฏ Few-shot classification with SetFit#

SetFit is an exciting open-source package for Few-shot classification developed by teams at Hugging Face and Intel Labs. You can read all about it on the project repository.

To showcase how powerful is the combination of SetFit and Argilla:

  • We manually label 55 examples from the unlabelled split of the IMDb dataset,

  • we train a model in 5 min,

  • and without using a single example from the original IMDb training set, we achieve a 0.9 accuracy on the full test set!

Summary#

In this tutorial, youโ€™ll learn to:

  1. Load a unlabelled dataset in Argilla. Weโ€™ll be using the unlabelled split from the imdb movie reviews sentiment dataset. This same workflow can be applied to any custom dataset, problem, and language!

  2. Manually label a FEW examples using the UI.

  3. Train a SetFit model to get highly competitive results. For this example, with only 55 examples, we get 0.9 accuracy on the test set which is comparable to models fine-tuned on 3K examples. That means similar performance with 50x less examples ๐Ÿคฏ.

For reference see the Hugging Face Hub and PapersWithCode leaderboards.

Letโ€™s get started!

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

deploy on spaces

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argillaโ€™s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

  • Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโ€™t forget to change the runtime type to GPU for faster model training and inference.

  • Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter Notebook tool of your choice.

[ ]:
%pip install argilla "setfit~=0.2.0" "datasets~=2.3.0" -qqq

Letโ€™s import the Argilla module for reading and writing data:

[ ]:
import argilla as rg

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

[ ]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="http://localhost:6900",
    api_key="owner.apikey",
    workspace="admin"
)

If youโ€™re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

[ ]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# # Replace workspace with the name of your workspace
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="owner.apikey",
#     workspace="admin",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Letโ€™s import the modules we need:

[ ]:
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

Load unlabelled dataset in Argilla#

First, we load the unsupervised split from the imdb dataset and create a new Argilla dataset with 100 random examples:

[ ]:
unlabelled = (
    load_dataset("imdb", split="unsupervised").shuffle(seed=42).select(range(100))
)

unlabelled = rg.DatasetForTextClassification.from_datasets(unlabelled)

rg.log(unlabelled, "imdb_unlabelled")

Manual labeling#

In this step, we create the labels pos and neg using the same label scheme as the original dataset. Then we use the UI to sequentially label a few examples. For the example, we spent literally 15 minutes.

Before training, you can easily share the dataset using the push_to_hub method. This might be useful if you donโ€™t have a GPU on your machine and want to use a training service or Colab for example. More info here

[ ]:
rg.load("imdb_unlabelled").prepare_for_training().push_to_hub("mini-imdb")

Train and evaluate SetFit model#

Finally, we are ready to test SetFit!

Thanks to Argillaโ€™s integration with datasets and the Hub, if you donโ€™t have a local GPU you can use this Google Colab to reproduce the training process with the labelled dataset. If you use a GPU runtime, it literally takes 5 minutes to train.

Below we load the dataset from Argilla, format it for training with transformers, load the full IMDb test dataset, load a pre-trained sentence transformers model, train the SetFit model, and evaluate it!

[ ]:
# Load the hand-labeled dataset from Argilla
train_ds = rg.load("imdb_unlabelled").prepare_for_training()

# Load the full IMDb test dataset
test_ds = load_dataset("imdb", split="test")


# Load SetFit model from Hub
model = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    loss_class=CosineSimilarityLoss,
    batch_size=16,
    num_iterations=20,  # The number of text pairs to generate
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

Optionally, you can share your amazing model with the world!

[ ]:
trainer.push_to_hub("setfit-mini-imdb")

Conclusion#

The metrics object should give you around 0.9 accuracy on the full test set ๐ŸŽ‰

And remember:

  • We have manually labeled 55 examples,

  • We havenโ€™t used a single example from the original training set,

  • and weโ€™ve trained the model in 5 min!

Now, I donโ€™t think you have any more excuses to not invest some time labeling a few good-quality examples!

If you are interested in SetFit, you can check our other SetFit + Argilla tutorials:

Or check out the SetFit repository on GitHub.