📑 Making the Most of Markdown within Argilla TextFields#

As you may have noticed, Argilla supports Markdown within its text fields. This means you can easily add formatting like bold, italic or highlighted text, links, and even insert HTML elements like images, audios, videos, and iframes. It’s a powerful tool to have at your disposal. Let’s dive in!

In this notebook, we will go over the basics of Markdown, and how to use it within Argilla.

Exploiting the power of displaCy for NER and relationship extraction.
Exploring multi-modality: video, audio, and image.
Inspecting PDFs.

Let’s get started!

Friendly Reminder: Multimedia in Markdown is here, but it’s still in the experimental phase. As we navigate the early stages, there are limits on file sizes due to ElasticSearch constraints, and the visualization and loading times may vary depending on your browser. We’re on the case to improve this and welcome your feedback and suggestions! 🌟🚀

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argilla’s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Don’t forget to change the runtime type to GPU for faster model training and inference.
Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter Notebook tool of your choice.

Set up the Environment#

To complete this tutorial, you will need to install the Argilla client and a few third-party libraries using pip:

[ ]:

# %pip install --upgrade pip
%pip install argilla
%pip install datasets
%pip install spacy spacy-transformers
%pip install Pillow
%pip install span_marker
%pip install soundfile librosa
!python -m spacy download en_core_web_sm

Let’s make the needed imports:

[ ]:

import argilla as rg
from argilla.client.feedback.utils import audio_to_html, image_to_html, video_to_html, pdf_to_html

import re
import os
import pandas as pd
import span_marker
import tarfile
import glob
import subprocess
import random

import spacy
from spacy import displacy

from datasets import load_dataset

from huggingface_hub import hf_hub_download

If you are running Argilla using the Docker quickstart image or Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

[ ]:

# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key
# Replace workspace with the name of your workspace
rg.init(
    api_url="http://localhost:6900",
    api_key="owner.apikey",
    workspace="admin"
)

If you’re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

[ ]:

# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://[your-owner-name]-[your_space_name].hf.space",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Exploiting `displaCy`#

SpaCy is a well-know open-source library for Natural Language Processing (NLP). It provides a wide range of models for different languages, and it is very easy to use. One of the provided options is `displaCy <https://spacy.io/usage/visualizers>`__ a visualizer for the output of the NLP models. In this tutorial, we will be using it to visualize the output of the NER model.

Using `displaCy`#

First, we will explain how displaCy works by importing the English SpaCy pipeline (en_core_web_sm) while excluding the default NER component. Later, we replace this component by using the add_pipe method to introduce the new span_marker component at the pipeline’s conclusion. This new component is responsible for conducting NER training with the specified model.

[ ]:

# Load the custom pipeline
nlp = spacy.load(
    "en_core_web_sm",
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

Now, you can check how using the displacy.render function, which takes in the text and the output of the model, returns a HTML string. BBelow, two examples are provided: the first illustrates the sentence’s dependency tree, while the second showcases the NER findings.

[3]:

# Show the dependency parse
doc = nlp("Rats are various medium-sized, long-tailed rodents.")
displacy.render(doc, style="dep")

c:\Users\sarah\miniconda3\envs\argilla\lib\site-packages\datasets\table.py:1395: FutureWarning: promote has been superseded by mode='default'.
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
c:\Users\sarah\miniconda3\envs\argilla\lib\site-packages\datasets\table.py:1421: FutureWarning: promote has been superseded by mode='default'.
  table = cls._concat_blocks(blocks, axis=0)

[12]:

# Show the entity recognition
text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."
doc2 = nlp(text)
displacy.render(doc2, style="ent")

When Sebastian Thrun person started working on self-driving cars at Google organization in 2007, few people outside of the company took him seriously.

You can also add custom highlights to the text by using create_token_highlights and a custom color map. For example:

from argilla.client.feedback.utils import create_token_highlights
tokens = ["This", "is", "a", "test"]
weights = [0.1, 0.2, 0.3, 0.4]
html = create_token_highlights(tokens, weights, c_map=custom_RGB) # 'viridis' by default

Example: creating a `FeedbackDataset` with the displaCy output#

In the example, we show how to create an Argilla FeedbackDataset adding the displaCy output to it. This way, we can check the accuracy of the model and asses whether the user want to apply a correction to the dependencies and/or entities.

First, we configure the ``FeedbackDataset` </practical_guides/create_dataset.html#configure-the-dataset>`__. In fields, we use three TextField to show the default text, dependencies and entities. While, in questions, we add a LabelQuestion, a MultiLabelQuestion and two TextQuestions.

[2]:

# Create the FeedbackDataset configuration
dataset_spacy = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="text", required= True, use_markdown=True),
        rg.TextField(name="dependency-tree", required= True, use_markdown=True),
        rg.TextField(name="entities", required= True, use_markdown=True)
    ],
    questions=[
        rg.LabelQuestion(name="relevant", title="Is the text relevant?", labels=["Yes", "No"], required=True),
        rg.MultiLabelQuestion(name="question-multi", title="Mark which is correct", labels=["flag-pos", "flag-ner"], required=True),
        rg.TextQuestion(name="dependency-correction", title="Write the correct answer if needed", use_markdown=True),
        rg.TextQuestion(name="ner-correction", title="Write the correct answer if needed", use_markdown=True)
    ]
)
dataset_spacy

[2]:

FeedbackDataset(
    fields=[TextField(name='text', title='Text', required=True, type='text', use_markdown=True), TextField(name='dependency-tree', title='Dependency-tree', required=True, type='text', use_markdown=True), TextField(name='entities', title='Entities', required=True, type='text', use_markdown=True)]
    questions=[LabelQuestion(name='relevant', title='Is the text relevant?', description=None, required=True, type='label_selection', labels=['Yes', 'No'], visible_labels=None), MultiLabelQuestion(name='question-multi', title='Mark which is correct', description=None, required=True, type='multi_label_selection', labels=['flag-pos', 'flag-ner'], visible_labels=None), TextQuestion(name='dependency-correction', title='Write the correct answer if needed', description=None, required=True, type='text', use_markdown=True), TextQuestion(name='ner-correction', title='Write the correct answer if needed', description=None, required=True, type='text', use_markdown=True)]
    guidelines=None)
)

Now, we load the basic few-nerd dataset from Hugging Face. This dataset contains a few sentences, and the output of the NER model. We will be using this dataset to show how to use displaCy within Argilla.

[ ]:

# Read the HF dataset
dataset_fewnerd = load_dataset("DFKI-SLT/few-nerd", "supervised", split="train[:20]")

Next, we will use this dataset to populate our Argilla FeedbackDataset. We will be using the displacy.render function to render the displacy output as html setting jupyter=False, and add it to the FeedbackDataset. We will also add the text, and the output of the NER model to the FeedbackDataset. Finally, we will also add markdown formatted tables to support basic support for NER and dependency annotation.

[15]:

# Load the custom pipeline
nlp = spacy.load(
    "en_core_web_sm",
    exclude=["ner"]
)
nlp.add_pipe("span_marker", config={"model": "tomaarsen/span-marker-bert-tiny-fewnerd-coarse-super"})

# Read the dataset and run the pipeline
texts = [" ".join(x["tokens"]) for x in dataset_fewnerd]
docs = nlp.pipe(texts)

[ ]:

# Define the function to set the correct width and height of the SVG element
def wrap_in_max_width(html):
    html = html.replace("max-width: none;", "")

    # Remove existing width and height setting based on regex width="/d"
    html = re.sub(r"width=\"\d+\"", "overflow-x: auto;", html)
    html = re.sub(r"height=\"\d+\"", "", html)

    # Find the SVG element in the HTML output
    svg_start = html.find("<svg")
    svg_end = html.find("</svg>") + len("</svg>")
    svg = html[svg_start:svg_end]

    # Set the width and height attributes of the SVG element to 100%
    svg = svg.replace("<svg", "<svg width='100%' height='100%'")

    # Wrap the SVG element in a div with max-width and horizontal scrolling
    return f"<div style='max-width: 100%; overflow-x: auto;'>{svg}</div>"

[ ]:

# Add the records to the FeedbackDataset
records = []
for doc in docs:
    record = rg.FeedbackRecord(
        fields={
            "text": doc.text,
            "dependency-tree": displacy.render(doc, style="dep", jupyter=False),
            "entities": displacy.render(doc, style="ent", jupyter=False)
        },
        suggestions=[{
                "question_name": "dependency-correction",
                "value": pd.DataFrame([{"Label": token.dep_, "Text": token.text} for token in doc]).to_markdown(index=False)

            },
            {
                "question_name": "ner-correction",
                "value": pd.DataFrame([{"Label": ent.label_, "Text": ent.text} for ent in doc.ents]).to_markdown(index=False),
            }
        ]
    )
    records.append(record)

dataset_spacy.add_records(records)

[ ]:

# Push the dataset to Argilla
dataset_spacy = dataset_spacy.push_to_argilla(name="exploiting_displacy", workspace="admin")

Exploring Multi-Modality: Video, Audio and Image#

As we already mentioned, Argilla supports handling video, audio, and images within markdown fields, provided they are formatted in HTML. To facilitate this, we offer three functions: video_to_html, audio_to_html, and image_to_html. These functions accept either the file path or the file’s byte data and return the corresponding HTMurl to render the media file within the Argilla user interface. Additionally, you can also set the width and height in pixel or percentage for video and image (defaults to the original dimensions) and the autoplay and loop attributes to True for audio and video (defaults to False).

We will define our FeedbackDataset with a TextField to add the media content. We will also add a question to ask the user to describe the video, audio, or image file.

[4]:

# Configure the FeedbackDataset
ds_multi_modal = rg.FeedbackDataset(
    fields=[rg.TextField(name="content", use_markdown=True, required=True)],
    questions=[rg.TextQuestion(name="description", title="Describe the content of the media:", use_markdown=True, required=True)],
)
ds_multi_modal

[4]:

FeedbackDataset(
    fields=[TextField(name='content', title='Content', required=True, type='text', use_markdown=True)]
    questions=[TextQuestion(name='description', title='Describe the content of the media:', description=None, required=True, type='text', use_markdown=True)]
    guidelines=None)
)

We will use the corresponding functions to the add_records-method and we add the media content to the FeedbackDataset.

[24]:

# Add the records
records = [
    rg.FeedbackRecord(fields={"content": video_to_html("/content/snapshot.mp4", autoplay=True)}),
    rg.FeedbackRecord(fields={"content": audio_to_html("/content/sea.wav", autoplay=True, loop=True)}),
    rg.FeedbackRecord(fields={"content": image_to_html("/content/peacock.jpg", width="50%", height="50%")}),
]
ds_multi_modal.add_records(records)

[ ]:

# Push the dataset to Argilla
ds_multi_modal = ds_multi_modal.push_to_argilla("multi-modal-basic", workspace="admin")

Inspecting PDFs#

Argilla also supports PDFs. You can add a PDF to a TextField by using the pdf_to_html function similarly to what we have done before. This function accepts either the file path, the URLs or the file’s byte data and returns the corresponding HTML to render the PDF within the Argilla user interface. Optionally, you can also set the width and height in pixel or percentage (defaults to 1000px). Below, you can see an example of how to use this function.

[ ]:

# In this case, we will use the URL of the PDF file
file_url = "https://arxiv.org/pdf/2310.06825.pdf"

# Configure the FeedbackDataset
ds_pdf = rg.FeedbackDataset(
    fields=[
        rg.TextField(name="content", use_markdown=True, required=True),
    ],
    questions=[
        rg.TextQuestion(name="description", use_markdown=True, required=True)
    ],
)

# Push the dataset to Argilla
ds_pdf = ds_pdf.push_to_argilla(name='analyze_pdf_dataset', workspace='argilla')

# Add the records using pdf_to_html
records = [
    rg.FeedbackRecord(fields={"content": pdf_to_html(file_source=file_url, width="700px", height="700px")})]
ds_pdf.add_records(records)

Tip

You can also view your .docx, .pptx or .xlsx in the Argilla UI setting use_markdown=True and embedding it if they are on a public URL. For instance, you can upload them to your Google Drive and get the public URL by changing the sharing settings to “Anyone with the link can view”.

file_url = "your-sharable-link.xlsx"
html = f"<embed src={file_url} type=application/pdf width=700px height=700px/></embed>"

ds = rg.FeedbackDataset(...)

records = [
    rg.FeedbackRecord(fields={"xlsx_file": html})]
ds.add_records(records)

ds = ds.push_to_argilla(name='xlsx', workspace='argilla')

Now that you’ve learned the Markdown tips, it’s your turn to create you own multi-modal FeedbackDataset! 🔥

📑 Making the Most of Markdown within Argilla TextFields#

Running Argilla#

Set up the Environment#

Exploiting displaCy#

Using displaCy#

Example: creating a FeedbackDataset with the displaCy output#

Exploring Multi-Modality: Video, Audio and Image#

Example: Creating a multi-modal-video-audio-image FeedbackDataset#

Inspecting PDFs#

Exploiting `displaCy`#

Using `displaCy`#

Example: creating a `FeedbackDataset` with the displaCy output#

Example: Creating a multi-modal-video-audio-image `FeedbackDataset`#