💫 Explore and analyze
spaCy NER predictions#
In this tutorial, we will learn to log spaCy Name Entity Recognition (NER) predictions.
This is useful for:
🧐Evaluating pre-trained models.
🔎Spotting frequent errors both during development and production.
📈Improving your pipelines over time using Argilla annotation mode.
🎮Monitoring your model predictions using Argilla integration with Kibana
Let’s get started!
In this tutorial we will learn how to explore and analyze spaCy NER pipelines in an easy way.
We will load the Gutenberg Time dataset from the Hugging Face Hub and use a transformer-based spaCy model for detecting entities in this dataset and log the detected entities into a Argilla dataset. This dataset can be used for exploring the quality of predictions and for creating a new training set, by correcting, adding and validating entities.
Then, we will use a smaller spaCy model for detecting entities and log the detected entities into the same Argilla dataset for comparing its predictions with the previous model. And, as a bonus, we will use Argilla and spaCy on a more challenging dataset: IMDB.
For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:
Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:
For details about configuring your deployment, check the official Hugging Face Hub guide.
Launch Argilla using Argilla’s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.
For more information on deployment options, please check the Deployment section of the documentation.
This tutorial is a Jupyter Notebook. There are two options to run it:
Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don’t forget to change the runtime type to GPU for faster model training and inference.
Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.
%pip install argilla -qqq %pip install torch -qqq %pip install datasets "spacy[transformers]~=3.0" protobuf -qqq !python -m spacy download en_core_web_trf !python -m spacy download en_core_web_sm
Let’s import the Argilla module for reading and writing data:
import argilla as rg
If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the
# Replace api_url with the url to your HF Spaces URL if using Spaces # Replace api_key if you configured a custom API key rg.init( api_url="http://localhost:6900", api_key="team.apikey" )
Finally, let’s include the imports we need:
from datasets import load_dataset import pandas as pd import spacy from tqdm.auto import tqdm
If you want to skip running the spaCy pipelines, you can also load the resulting Argilla records directly from the Hugging Face Hub, and continue the tutorial logging them to the Argilla web app. For example:
records = rg.read_datasets( load_dataset("argilla/gutenberg_spacy_ner", split="train"), task="TokenClassification", )
The Argilla records of this tutorial are available under the names “argilla/gutenberg_spacy_ner” and “argilla/imdb_spacy_ner”.
For this tutorial, we’re going to use the Gutenberg Time dataset from the Hugging Face Hub. It contains all explicit time references in a dataset of 52,183 novels whose full text is available via Project Gutenberg. From extracts of novels, we are surely going to find some NER entities.
dataset = load_dataset("gutenberg_time", split="train", streaming=True) # Let's have a look at the first 5 examples of the train set. pd.DataFrame(dataset.take(5))
|0||4447||5||five o'clock||True||145||147||I crossed the ground she had traversed , notin...|
|1||4447||12||the fall of the winter noon||True||68||74||So profoundly penetrated with thoughtfulness w...|
|2||28999||12||midday||True||46||47||And here is Hendon , and it is time for us to ...|
|3||28999||12||midday||True||133||134||Sorrows and trials she had had in plenty in he...|
|4||28999||0||midnight||True||43||44||Jeannie joined her friend in the window-seat ....|
Logging spaCy NER entities into Argilla#
Let’s instantiate a spaCy transformer
nlp pipeline and apply it to the first 50 examples in our dataset, collecting the tokens and NER entities.
nlp = spacy.load("en_core_web_trf") # Creating an empty record list to save all the records records =  # Iterate over the first 50 examples of the Gutenberg dataset for record in tqdm(list(dataset.take(50))): # We only need the text of each instance text = record["tok_context"] # spaCy Doc creation doc = nlp(text) # Entity annotations entities = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents] # Pre-tokenized input text tokens = [token.text for token in doc] # Argilla TokenClassificationRecord list records.append( rg.TokenClassificationRecord( text=text, tokens=tokens, prediction=entities, prediction_agent="en_core_web_trf", ) ) rg.log(records=records, name="gutenberg_spacy_ner")
If you go to the
gutenberg_spacy_ner dataset in Argilla you can explore the predictions of this model.
Filter records containing specific entity types,
See the most frequent “mentions” or surface forms for each entity. Mentions are the string values of specific entity types, such as for example “1 month” can be the mention of a duration entity. This is useful for error analysis, to quickly see potential issues and problematic entity types,
Use the free-text search to find records containing specific words,
And validate, include or reject specific entity annotations to build a new training set.
Now let’s compare with a smaller, but more efficient pre-trained model.
nlp = spacy.load("en_core_web_sm") # Creating an empty record list to save all the records records =  # Iterate over 10000 examples of the Gutenberg dataset for record in tqdm(list(dataset.take(10000))): # We only need the text of each instance text = record["tok_context"] # spaCy Doc creation doc = nlp(text) # Entity annotations entities = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents] # Pre-tokenized input text tokens = [token.text for token in doc] # Argilla TokenClassificationRecord list records.append( rg.TokenClassificationRecord( text=text, tokens=tokens, prediction=entities, prediction_agent="en_core_web_sm", ) ) rg.log(records=records, name="gutenberg_spacy_ner")
Exploring and comparing
If you go to your
gutenberg_spacy_ner dataset, you can explore and compare the results of both models.
To only see predictions of a specific model, you can use the
predicted by filter, which comes from the
prediction_agent parameter of your
Explore the IMDB dataset#
So far, both spaCy pretrained models seem to work pretty well. Let’s try with a more challenging dataset, which is more dissimilar to the original training data these models have been trained on.
imdb = load_dataset("imdb", split="test") records =  for record in tqdm(imdb.select(range(5000))): # We only need the text of each instance text = record["text"] # spaCy Doc creation doc = nlp(text) # Entity annotations entities = [(ent.label_, ent.start_char, ent.end_char) for ent in doc.ents] # Pre-tokenized input text tokens = [token.text for token in doc] # Argilla TokenClassificationRecord list records.append( rg.TokenClassificationRecord( text=text, tokens=tokens, prediction=entities, prediction_agent="en_core_web_sm" ) ) rg.log(records=records, name="imdb_spacy_ner")
Exploring this dataset highlights the need of fine-tuning for specific domains.
For example, if we check the most frequent mentions for Person, we find two highly frequent missclassified entities: gore (the film genre) and Oscar (the prize).
You can easily check every example by using the filters and search-box.
In this tutorial, you learned how to log and explore different
spaCy NER models with Argilla. Now you can:
Build custom dashboards using Kibana to monitor and visualize spaCy models.
Build training sets using pre-trained spaCy models.
Appendix: Log datasets to the Hugging Face Hub#
Here we will show you an example of how you can push a Argilla dataset (records) to the Hugging Face Hub. In this way you can effectively version any of your Argilla datasets.
records = rg.load("gutenberg_spacy_ner") records.to_datasets().push_to_hub("<name of the dataset on the HF Hub>")
If you want to continue learning Argilla:
🙋♀️ Join the Argilla Slack community!
⭐ Argilla Github repo to stay updated.
📚 Argilla documentation for more guides and tutorials.