👨🏽💬 Text Generation#

The expression TextGeneration encompasses text generation tasks where the model receives and outputs a sequence of tokens. Examples of such tasks are machine translation, text summarization, paraphrase generation, etc.

Machine translation#

Machine translation is the task of translating text from one language to another. It is arguably one of the oldest NLP tasks, but human parity remains an open challenge especially for low resource languages and domains.

In the following small example we will showcase how Argilla can help you to fine-tune an English-to-Spanish translation model. Let us assume we want to translate “Sesame Street” related content. If you have been to Spain before you probably noticed that named entities (like character or band names) are often translated quite literally or are very different from the original ones.

We will use a pre-trained 🤗 transformers model to get a few suggestions for the translation, and then correct them in Argilla to obtain a training set for the fine-tuning.

[ ]:

#!pip install transformers

from transformers import pipeline
import argilla as rg

# Instantiate the translator
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")

# 'Sesame Street' related phrase
en_phrase = (
    "Sesame Street is an American educational children's television series starring the"
    " muppets Ernie and Bert."
)

# Get two predictions from the translator
es_predictions = [
    output["translation_text"]
    for output in translator(en_phrase, num_return_sequences=2)
]

# Log the record to Argilla and correct them
record = rg.TextGenerationRecord(
    text=en_phrase,
    prediction=es_predictions,
)
rg.log(record, name="sesame_street_en-es")

# For a real training set you probably would need more than just one 'Sesame Street' related phrase.

In the Argilla web app we can now easily browse the predictions and annotate the records with a corrected prediction of our choice. The predictions for our example phrase are: 1. Sesame Street es una serie de televisión infantil estadounidense protagonizada por los muppets Ernie y Bert. 2. Sesame Street es una serie de televisión infantil y educativa estadounidense protagonizada por los muppets Ernie y Bert.

We probably would choose the second one and correct it in the following way:

Barrio Sésamo es una serie de televisión infantil y educativa estadounidense protagonizada por los teleñecos Epi y Blas.*

After correcting a substantial number of example phrases, we can load the corrected data set as a DataFrame to use it for the fine-tuning of the model.

[ ]:

# load corrected translations for the fine-tuning of the translation model
df = rg.load("sesame_street_en-es")

Text Summarization#

Summarization is the task of producing a shorter version of a document while preserving its important information. Some models can extract text from the original input, while other models can generate entirely new text.

Following example explain how to use Summarization Pipeline along with Argilla to log the summarization.

[ ]:

from transformers import pipeline
import argilla as rg

# test text
phrase = (
    "Paris is the capital and most populous city of France, with an estimated"
    " population of 2,175,601 residents as of 2018, in an area of more than 105 square"
    " kilometres (41 square miles). The City of Paris is the centre and seat of"
    " government of the region and province of ÃƒÂŽle-de-France, or Paris Region, which"
    " has an estimated population of 12,174,880, or about 18 percent of the population"
    " of France as of 2017."
)
classifier = pipeline("summarization")

# Get three summaries
predictions = [
    output["summary_text"]
    for output in classifier(phrase, num_return_sequences=3, max_length=56)
]

# Log the records to Argilla
record = rg.TextGenerationRecord(
    text=phrase,
    prediction=predictions,
)
rg.log(
    record,
    name="paris-demographic-summary",
    tags={"task": "summarization", "family": "textgeneration"},
)