๐งโ๐ป Log, load, and prepare data#
This guide showcases some features of the Dataset
classes in the Argilla client. The Dataset classes are lightweight containers for Argilla records. These classes facilitate importing from and exporting to different formats (e.g., pandas.DataFrame
, datasets.Dataset
) as well as sharing and versioning Argilla datasets using the Hugging Face Hub.
For each record type thereโs a corresponding Dataset class called DatasetFor<RecordType>
. You can look up their API in the reference section
Create a Dataset#
The main component of the Argilla data model is called a record. A dataset in Argilla is a collection of these records. Records can be of different types depending on the currently supported tasks:
TextClassificationRecord
TokenClassificationRecord
Text2TextRecord
The most critical attributes of a record that are common to all types are:
text
: The input text of the record (Required);annotation
: Annotate your record in a task-specific manner (Optional);prediction
: Add task-specific model predictions to the record (Optional);metadata
: Add some arbitrary metadata to the record (Optional);-รฑ
In Argilla, records are created programmatically using the client library within a Python script, a Jupyter notebook, or another IDE.
Letโs see how to create and upload a basic record to the Argilla web app (make sure Argilla is already installed on your machine as described in the setup guide):
Under the hood the Dataset classes store the records in a simple Python list. Therefore, working with a Dataset class is not very different to working with a simple list of records:
[ ]:
import argilla as rg
# Create a TextClassificationRecord
record = rg.TextClassificationRecord(
text="Hello world, this is me!",
prediction=[("LABEL1", 0.8), ("LABEL2", 0.2)],
annotation="LABEL1",
multi_label=False,
)
# Create a TokenClassificationRecord
record = rg.TokenClassificationRecord(
text="Michael is a professor at Harvard",
tokens=["Michael", "is", "a", "professor", "at", "Harvard"],
prediction=[("NAME", 0, 7), ("LOC", 26, 33)],
)
# Create a Text2TextRecord
record = rg.Text2TextRecord(
text="My name is Sarah and I love my dog.",
prediction=["Je m'appelle Sarah et j'aime mon chien."],
)
# Start with a list of Argilla records
# Note that each dataset can only contain records of the same type (e.g. TextClassificationRecord)
dataset_rg = rg.DatasetForTextClassification([record])
# Loop over the dataset
for record in dataset_rg:
print(record)
# Index into the dataset
dataset_rg[0] = rg.TextClassificationRecord(text="replace record")
# log a dataset to the Argilla web app
rg.log(
recors=dataset_rg,
name="my_dataset",
tags={"example": "test"},
background=True,
verbose=False
)
The Dataset classes do some extra checks for you, to make sure you do not mix record types when appending or indexing into a dataset.
Logging for deployments#
Argilla currently gives users several ways to log model predictions besides the rg.log
async method.
Log with rg.monitor
#
For widely-used libraries Argilla includes an โauto-monitoringโ option via the rg.monitor
method. Currently supported libraries are Hugging Face Transformers and spaCy, if youโd like to see another library supported feel free to add a discussion or issue on GitHub.
rg.monitor
will wrap HF and spaCy pipelines so every time you call them, the output of these calls will be logged into the dataset of your choice, as a background process, in a non-blocking way. Additionally, rg.monitor
will add several tags to your dataset such as the library build version, the model name, the language, etc. This should also work for custom (private) pipelines, not only the Hubโs or official spaCy models.
It is worth noting that this feature is useful beyond monitoring, and can be used for data collection (e.g., bootstrapping data annotation with pre-trained pipelines), model development (e.g., error analysis), and model evaluation (e.g., combined with data annotation to obtain evaluation metrics).
Using spaCy#
[ ]:
import spacy
import argilla as rg
nlp = spacy.load("en_core_web_sm")
nlp = rg.monitor(nlp, dataset="nlp_monitoring_spacy")
dataset.map(lambda example: {"prediction": nlp(example["text"])})
Using transformers#
[ ]:
from transformers import pipeline
import argilla as rg
nlp = pipeline(
"sentiment-analysis", return_all_scores=True, padding=True, truncation=True
)
nlp = rg.monitor(nlp, dataset="nlp_monitoring")
dataset.map(lambda example: {"prediction": nlp(example["text"])})
Using flAIr#
[ ]:
import argilla as rg
from flair.data import Sentence
from flair.models import SequenceTagger
# load tagger
tagger = rg.monitor(
SequenceTagger.load("flair/ner-english"), dataset="flair-example", sample_rate=1.0
)
# make example sentence
sentence = Sentence("George Washington went to Washington")
# predict NER tags. This will log the prediction in Argilla
tagger.predict(sentence)
Log using ASGI middleware#
For using the ASGI middleware, see this tutorial.
Load a Dataset#
It is very straightforward to simply load a dataset. This can be done using rg.load
. Additionally, you can check our query page for custom info about querying and you can check our vector page for info about vector search.
[ ]:
import argilla as rg
rg.load(
name="my_dataset",
query="my AND query",
limit=42
ids=["id1", "id2", "id3"],
vectors=["vector1", "vector2", "vector3"],
)
Update a Dataset#
Argilla datasets have certain settings that you can configure via the rg.*Settings
classes, for example rg.TextClassificationSettings
.
Define a labeling schema#
You can define a labeling schema for your Argilla dataset, which fixes the allowed labels for your predictions and annotations. Once you set a labeling schema, each time you log to the corresponding dataset, Argilla will perform validations of the added predictions and annotations to make sure they comply with the schema.
[ ]:
import argilla as rg
# Define labeling schema
settings = rg.TextClassificationSettings(label_schema=["A", "B", "C"])
# Apply settings to a new or already existing dataset
rg.configure_dataset(name="my_dataset", settings=settings)
# Logging to the newly created dataset triggers the validation checks
rg.log(rg.TextClassificationRecord(text="text", annotation="D"), "my_dataset")
# BadRequestApiError: Argilla server returned an error with http status: 400
Update Records#
It is possible to update records from your Argilla datasets using our Python API. This approach works the same way as an upsert in a normal database, based on the record id
. You can update any arbitrary parameters and they will be over-written if you use the id
of the original record.
[ ]:
import argilla as rg
# read all records in the dataset or define a specific search via the `query` parameter
record = rg.load("my_first_dataset")
# modify first record metadata (if no previous metadata dict you might need to create it)
record[0].metadata["my_metadata"] = "im a new value"
# log record to update it, this will keep everything but add my_metadata field and value
rg.log(name="my_first_dataset", records=record[0])
Import a Dataset#
When you have your data in a pandas DataFrame or a datasets Dataset, we provide some neat shortcuts to import this data into a Argilla Dataset. You have to make sure that the data follows the record model of a specific task, otherwise you will get validation errors. Columns in your DataFrame/Dataset that are not supported or recognized, will simply be ignored.
The record models of the tasks are explained in the reference section.
Note
Due to itโs pyarrow nature, data in a datasets.Dataset
has to follow a slightly different model, that you can look up in the examples of the Dataset*.from_datasets
docstrings.
[ ]:
import argilla as rg
# import data from a pandas DataFrame
dataset_rg = rg.read_pandas(my_dataframe, task="TextClassification")
# or
dataset_rg = rg.DatasetForTextClassification.from_pandas(my_dataframe)
# import data from a datasets Dataset
dataset_rg = rg.read_datasets(my_dataset, task="TextClassification")
# or
dataset_rg = rg.DatasetForTextClassification.from_datasets(my_dataset)
We also provide helper arguments you can use to read almost arbitrary datasets for a given task from the Hugging Face Hub. They map certain input arguments of the Argilla records to columns of the given dataset. Letโs have a look at a few examples:
[ ]:
import argilla as rg
from datasets import load_dataset
# the "poem_sentiment" dataset has columns "verse_text" and "label"
dataset_rg = rg.DatasetForTextClassification.from_datasets(
dataset=load_dataset("poem_sentiment", split="test"),
text="verse_text",
annotation="label",
)
# the "snli" dataset has the columns "premise", "hypothesis" and "label"
dataset_rg = rg.DatasetForTextClassification.from_datasets(
dataset=load_dataset("snli", split="test"),
inputs=["premise", "hypothesis"],
annotation="label",
)
# the "conll2003" dataset has the columns "id", "tokens", "pos_tags", "chunk_tags" and "ner_tags"
rg.DatasetForTokenClassification.from_datasets(
dataset=load_dataset("conll2003", split="test"),
tags="ner_tags",
)
# the "xsum" dataset has the columns "id", "document" and "summary"
rg.DatasetForText2Text.from_datasets(
dataset=load_dataset("xsum", split="test"),
text="document",
annotation="summary",
)
You can also use the shortcut rg.read_datasets(dataset=..., task=..., **kwargs)
where the keyword arguments are passed on to the corresponding from_datasets()
method.
Reindex a Dataset#
Sometimes updates require us to reindex the data.
Argilla Metrics#
For our internally computed metrics, this can be done by simply, loading and logging the same records back to the same index. This is because our internal metrics are computed and updated during logging.
[ ]:
import argilla as rg
dataset = "my-outdated-dataset"
ds = rg.load(dataset)
rg.log(ds, dataset)
Elasticsearch#
For Elastic indices, re-indexing requires a bit more effort. To be certain of a proper re-indexing, we requires loading the records, and storing them within a completely new index.
[ ]:
import argilla as rg
dataset = "my-outdated-dataset"
ds = rg.load(dataset)
new_dataset = "my-new-dataset"
rg.log(ds, new_dataset)
Prepare dataset for training#
If you want to train a Hugging Face transformer or a spaCy NER pipeline, we provide a handy method to prepare your dataset: DatasetFor*.prepare_for_training()
. It will return a Hugging Face dataset, a spaCy DocBin or a SparkNLP-fromatted DataFrame, optimized for the training process with the Hugging Face Trainer, the spaCy cli or the SparkNLP API. Our libraries deepdive and training tutorials, show entire training
workflows for your favorite packages.
Train-test split#
It is possible to directly include train-test splits to the prepare_for_training
by passing the train_size
and test_size
parameters.
TextClassification#
For text classification tasks, it flattens the inputs into separate columns of the returned dataset and converts the annotations of your records into integers and writes them in a label column:
[ ]:
import argilla as rg
dataset_rg = rg.DatasetForTextClassification(
[
rg.TextClassificationRecord(
inputs={"title": "My title", "content": "My content"}, annotation="news"
)
]
)
dataset_rg.prepare_for_training(framework="transformers")[0]
# Output:
# {'title': 'My title', 'content': 'My content', 'label': 0}
import spacy
nlp = spacy.blank("en")
dataset_rg.prepare_for_training(framework="spacy", lang=nlp)
# Output:
# <spacy.tokens._serialize.DocBin object at 0x280613af0>
dataset_rg.prepare_for_training(framework="spark-nlp")
# Output:
# <pd.DataFrame>
TokenClassification#
For token classification tasks, it converts the annotations of a record into integers representing BIO tags and writes them in a ner_tags
column: By passing the framework
variable as transformers
or spacy
.
[ ]:
import argilla as rg
dataset_rg = rg.DatasetForTokenClassification(
[
rg.TokenClassificationRecord(
text="I live in Madrid",
tokens=["I", "live", "in", "Madrid"],
annotation=[("LOC", 10, 16)],
)
]
)
dataset_rg.prepare_for_training(framework="transformers")[0]
# Output:
# {..., 'tokens': ['I', 'live', 'in', 'Madrid'], 'ner_tags': [0, 0, 0, 1], ...}+
import spacy
nlp = spacy.blank("en")
dataset_rg.prepare_for_training(framework="spacy", lang=nlp)
# Output:
# <spacy.tokens._serialize.DocBin object at 0x280613af0>
dataset_rg.prepare_for_training(framework="spark-nlp")
# Output:
# <pd.DataFrame>
Next steps#
If you want to continue learning Argilla:
๐โโ๏ธ Join the Argilla Slack community!
โญ Argilla Github repo to stay updated.
๐ Argilla documentation for more guides and tutorials.