๐Ÿš€ Run Argilla with a Transformer in an active learning loop and a free GPU in your browser#

In this tutorial, you will learn how to set up a complete active learning loop with Google Colab with a GPU in the backend. This tutorial is based on the small-text active learning tutorial. The main difference is that this tutorial is designed to be run in a Google Colab notebook with a GPU as the backend for a more efficient active learning loop with Transformer models. It is recommended to follow this tutorial directly on Google Colab. You can open the Colab notebook via this hyperlink, create your own copy and modify it for your own use-cases.

โš ๏ธ Note that this notebook requires manual input to start Argilla in a terminal and to input an ngrok token. Please read the instructions for each cell. If you do not follow the instructions and execute everything in the correct order, the code will bug. If you face an error, restarting your runtime can solve several issues. โš ๏ธ

๐Ÿ™‹๐Ÿผโ€โ™‚๏ธ The notebook was contributed by Moritz Laurer

Initial setup on Google Colab#

In the Colab interface, you can choose a CPU (for initial testing) or a GPU (for an efficient active learning loop) by clicking Runtime > Change runtime type > Hardware accelerator in the menu in the top left. Once you have chosen your hardware, install the required packages.

[ ]:
%pip install "argilla[server, listeners]==1.1.1"
%pip install "transformers[sentencepiece]~=4.25.1"
%pip install "datasets~=2.7.1"
%pip install "small-text[transformers]~=1.1.1"
%pip install "colab-xterm~=0.1.2"
%pip install "pyngrok~=5.2.1"
%pip install "colab-xterm~=0.1.2"
[ ]:
# info on the hardware you are using - either a CPU or GPU
!nvidia-smi
# info on available ram
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('\n\nYour runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

Start the Argilla localhost in a terminal#

You now need to start Argilla localhost in separate terminal. We cannot just run !python -m argilla in a code cell on Colab, because the cell will run indefinitely and block us from running other cells. We therefore need to open a separate terminal to run Argilla.

  1. Option with Colab Pro: Open the Colab Pro terminal (button to the bottom left) and type in the terminal: python -m argilla

  2. Option without Colab Pro: Run the following code cell to get a free terminal window in the code cell with xterm. Then type python -m argilla in the terminal window

[ ]:
# create a terminal to run Argilla with, in case you don't have Colab Pro.
# type "python -m argilla" into the terminal that appears below this code cell.
%load_ext colabxterm
%xterm

The terminal window above should now display something like:

โ€œโ€ฆ INFO: Application startup complete.

INFO: Uvicorn running on http://0.0.0.0:6900 (Press CTRL+C to quit)โ€

Log data to argilla and start your active learning loop with small-text#

If you click on your public link above, you should be able to access Argilla, but there is no data logged to Argilla yet. The following code downloads an example dataset and logs it to Argilla. You can change the following code to download any other dataset you want to annotate. The following code follows the active learning with small-text tutorial and therefore contains less explanations.

[ ]:
# load dataset
import datasets
dataset_name = "trec"
dataset_hf = datasets.load_dataset(dataset_name, version=datasets.Version("2.0.0"))
# we work with only a sixth of the texts of the dataset for faster testing
dataset_hf["train"] = dataset_hf["train"].shard(num_shards=6, index=0)

[ ]:
## choose the transformer and load tokenizer
import torch
from transformers import AutoTokenizer

# Choose transformer model: In non-gpu environments we use a tiny model to increase efficiency
if not torch.cuda.is_available():
    transformer_model = "prajjwal1/bert-tiny"
    print(f"No GPU is available, we therefore use the small model '{transformer_model}' for the active learning loop.\n")
else:
    transformer_model = "microsoft/deberta-v3-xsmall"  #"bert-base-uncased"
    print(f"A GPU is available, we can therefore use '{transformer_model}' for the active learning loop.\n")

# Init tokenizer
tokenizer = AutoTokenizer.from_pretrained(transformer_model)

[ ]:
## create small_text transformersdataset object
import numpy as np
from small_text import TransformersDataset

num_classes = dataset_hf["train"].features["coarse_label"].num_classes
target_labels = np.arange(num_classes)

train_text = [row["text"] for row in dataset_hf["train"]]
train_labels = np.array([row["coarse_label"] for row in dataset_hf["train"]])

# Create the dataset for small-text
dataset_st = TransformersDataset.from_arrays(
    train_text, train_labels, tokenizer, target_labels=target_labels
)

# Create test dataset
test_text = [row["text"] for row in dataset_hf["test"]]
test_labels = np.array([row["coarse_label"] for row in dataset_hf["test"]])

dataset_test = TransformersDataset.from_arrays(
    test_text, test_labels, tokenizer, target_labels=np.arange(num_classes)
)


[ ]:
## setting up the active learner
from small_text import (
    BreakingTies,
    PoolBasedActiveLearner,
    TransformerBasedClassificationFactory,
    TransformerModelArguments,
)

# Define our classifier
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device: ", device)

num_epochs = 5  # higher values of around 40 will probably improve performance on small datasets, but the active learning loop will take longer
clf_factory = TransformerBasedClassificationFactory(
    TransformerModelArguments(transformer_model),
    num_classes=num_classes,
    kwargs={"device": device, "num_epochs": num_epochs, "lr": 2e-05, "mini_batch_size": 8,
            "early_stopping_no_improvement": 5}  # kwargs={"device": "cuda"}
)


# Define our query strategy
query_strategy = BreakingTies()

# Use the active learner with a pool containing all unlabeled data
active_learner = PoolBasedActiveLearner(clf_factory, query_strategy, dataset_st)

[ ]:
## draw an initial sample for the first annotation round
# https://small-text.readthedocs.io/en/v1.1.1/components/initialization.html
from small_text import random_initialization, random_initialization_stratified, random_initialization_balanced
import numpy as np

# Fix seed for reproducibility
np.random.seed(42)

# Number of samples in our queried batches
NUM_SAMPLES = 10

# Draw an initial subset from the data pool
#initial_indices = random_initialization(dataset_st, NUM_SAMPLES)
#initial_indices = random_initialization_balanced(train_labels, NUM_SAMPLES)
initial_indices = random_initialization_stratified(train_labels, NUM_SAMPLES)

[ ]:
### log the first data to Argilla
import argilla as rg

# Choose a name for the dataset
DATASET_NAME = f"{dataset_name}_with_active_learning"

# Define labeling schema
labels = dataset_hf["train"].features["coarse_label"].names
settings = rg.TextClassificationSettings(label_schema=labels)

# Create dataset with a label schema
rg.configure_dataset(name=DATASET_NAME, settings=settings)

# Create records from the initial batch
records = [
    rg.TextClassificationRecord(
        text=dataset_hf["train"]["text"][idx],
        metadata={"batch_id": 0},
        id=idx,
    )
    for idx in initial_indices
]

# Log initial records to Argilla
rg.log(records, DATASET_NAME)

[ ]:
### create active learning loop
from argilla.listeners import listener
from sklearn.metrics import accuracy_score

# Define some helper variables
LABEL2INT = dataset_hf["train"].features["coarse_label"].str2int
ACCURACIES = []

# Set up the active learning loop with the listener decorator
@listener(
    dataset=DATASET_NAME,
    query="status:Validated AND metadata.batch_id:{batch_id}",
    condition=lambda search: search.total == NUM_SAMPLES,
    execution_interval_in_seconds=3,
    batch_id=0,
)
def active_learning_loop(records, ctx):
    # 1. Update active learner
    print(f"Updating with batch_id {ctx.query_params['batch_id']} ...")
    y = np.array([LABEL2INT(rec.annotation) for rec in records])

    # initial update
    if ctx.query_params["batch_id"] == 0:
        indices = np.array([rec.id for rec in records])
        active_learner.initialize_data(indices, y)
    # update with the prior queried indices
    else:
        active_learner.update(y)
    print("Done!")

    # 2. Query active learner
    print("Querying new data points ...")
    queried_indices = active_learner.query(num_samples=NUM_SAMPLES)
    new_batch = ctx.query_params["batch_id"] + 1
    new_records = [
        rg.TextClassificationRecord(
            text=dataset_hf["train"]["text"][idx],
            metadata={"batch_id": new_batch},
            id=idx,
        )
        for idx in queried_indices
    ]

    # 3. Log the batch to Argilla
    rg.log(new_records, DATASET_NAME)

    # 4. Evaluate current classifier on the test set
    print("Evaluating current classifier ...")
    accuracy = accuracy_score(
        dataset_test.y,
        active_learner.classifier.predict(dataset_test),
    )

    ACCURACIES.append(accuracy)
    ctx.query_params["batch_id"] = new_batch
    print("Done!")

    print("Waiting for annotations ...")



active_learning_loop.start()

Extract annotated data for downstream use#

[ ]:
## https://docs.argilla.io/en/latest/getting_started/quickstart.html#Manual-extraction

# load your annotations
dataset_annotated = rg.load(DATASET_NAME)
# convert to Hugging Face format
dataset_annotated = dataset_annotated.prepare_for_training()
# now you can write your annotations to .csv, use them for training etc.
df_annotations = pd.DataFrame(dataset_annotated)
df_annotations.head()

Summary#

In this tutorial, we saw how you could embed Argilla in an active learning loop on a GPU in Google Colab. We relied on small-text to use a Hugging Face transformer within an active learning setup. In the end, we gathered a sample-efficient data set by annotating only the most informative records for the model.

Argilla makes it very easy to use a dedicated annotation team or subject matter experts as an oracle for your active learning system. They will only interact with the Argilla UI and do not have to worry about training or querying the system. We encourage you to try out active learning in your next project and make your and your annotatorโ€™s life a little easier.

Next steps#

โญ Argilla Github repo to stay updated.

๐Ÿ“š Argilla documentation for more guides and tutorials.

๐Ÿ™‹โ€โ™€๏ธ Join the Argilla community! A good place to start is our slack channel.