๐Ÿ—บ๏ธ Adding bias-equality features to text with disaggregators#

In this tutorial, we will show you how you can use the disaggregators package to identify potential bias in your training data. We will walk you through this by using the following steps:

  • ๐Ÿ“ฐ Load news summary data

  • ๐Ÿ—บ๏ธ apply disaggregator features

  • ๐Ÿ“Š analyze potential biases

Transformers Log Demo

Introduction#

โ€œAddressing fairness and bias in machine learning models is more important than ever! One form of fairness is equal performance across different groups or features. To measure this, evaluation datasets must be disaggregated across the different groups of interest.โ€ - HuggingFace.

In short, the disaggregators package, aims to answer the question: โ€œWhat is in your dataset and how does this influence groups of interest?โ€.

For other bias and explainability measures take a look at our other tutorials on explainability.

Letโ€™s get started!

Setup#

Apart from Argilla, weโ€™ll need a few third party libraries that can be installed via pip:

[ ]:
%pip install disaggregators -qqq
%python -m spacy download en_core_web_lg -qqq

๐Ÿ“ฐ Load news summary data#

For this analysis, we will be using our news summary dataset from the HuggingFace hub. This datasets is focused on a text2text summarization task, which requires news texts to be summarized into a single sentence or title. Due to the nice integration with the HuggingFace hub, we can easily do this within several lines of code.

[14]:
import argilla as rg
from datasets import load_dataset

# load from datasets
my_dataset = load_dataset("argilla/news-summary")
dataset_rg = rg.read_datasets(my_dataset["train"], task="Text2Text")

# log subset into argilla
rg.log(dataset_rg[:1000], "news-summary", chunk_size=50) # set smaller chunk size to overcome io-issues
1000 records logged to https://pre.argilla.io/datasets/recognai/news-summary
[14]:
BulkResponse(dataset='news-summary', processed=1000, failed=0)

๐Ÿ—บ๏ธ apply disaggregator features#

After having uploaded the data, we can now take a closer look at the potential disaggregators that the disaggregators package provides. It focuses on 5 main classes, with several sub-classes that can be assigned to the text, based on word-matches. This means, each text can also be assigned to multiple classes.

  • โ€œageโ€: [โ€œchildโ€, โ€œyouthโ€, โ€œadultโ€, โ€œseniorโ€]

  • โ€œgenderโ€: [โ€œmaleโ€, โ€œfemaleโ€]

  • โ€œpronounโ€: [โ€œshe_herโ€, โ€œhe_himโ€, โ€œthey_themโ€]

  • โ€œreligionโ€: [โ€œjudaismโ€, โ€œislamโ€, โ€œbuddhismโ€, โ€œchristianityโ€]

  • โ€œcontinentโ€: [โ€œafricaโ€, โ€œamericasโ€, โ€œasiaโ€, โ€œeuropeโ€, โ€œoceaniaโ€]

Even though we could choose to apply all categories, we can we will now only work with age and gender to simplify the anlysis.

[18]:
from disaggregators import Disaggregator
import pandas as pd
import argilla as rg

disaggregator_classes = ["age", "gender"]]
ds = rg.load("news-summary")
df = pd.DataFrame({"text": [rec.text for rec in ds]})
disaggregator = Disaggregator(disaggregator_classes, column="text")
new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)
df.head(5)
[18]:
text age.child age.youth age.adult age.senior gender.male gender.female
0 MEXICO CITY (Reuters) - Mexico central bank go... True True False False True False
1 WASHINGTON (Reuters) - The Trump administratio... True False False True True False
2 DUBAI (Reuters) - Iran has provided the capabi... False False False False False False
3 PALM BEACH, Fla. (Reuters) - U.S. President-el... False False False False True False
4 WASHINGTON (Reuters) - U.S. Senator Bill Nelso... False False False False True False

Now, we have found and apprehended each of the potential disaggregators, we can assign them to the metadata variable for each one of our records and update the same record ids in the Argilla database.

[26]:
metadata_ds = df[df.columns[1:]].to_dict(orient="records")
for metadata_rec, rec in zip(metadata_ds, ds):
    rec.metadata = metadata_rec
rg.log(ds, "news-summary", chunk_size=50) # upsert records
1000 records logged to https://pre.argilla.io/datasets/recognai/news-summary
[26]:
BulkResponse(dataset='news-summary', processed=1000, failed=0)

๐Ÿ“Š analyze potential biases#

Within the UI, there are two direct ways in which we can analyze the assigned bias-info.

Filter based on metadata info#

By applying filters, we can choose to equally distribute the number of annotations over the potential causes for bias. By doing so, we ensure the eventual training data is also evenly distributed. Alternatively, we can also decide to only label data that has zero disaggregation, assuming they do not contain any of the considered biases.

Transformers Log Demo

Inspect record info#

Even though inspecting the record info is a bit slower, we can potentially assume that it might provide context to record for annotators that might suspect bias within the data. This will allow them to take this into account during annotation.

Transformers Log Demo

Alternatives#

Besides the analyses mentioned above, there likely are way more interesting things you can do using this package. A good example being this HuggingFace space. So, be creative and avoid bias while doing so ๐Ÿ˜‰

Summary#

In this tutorial, we learned about the disaggregators package, and how we can integrate this within Argilla. This can help data-scientist, ML-engineers and annotators to manage and mitigate bias in their datasets.

Next steps#

โญ Argilla Github repo to stay updated.

๐Ÿ“š Argilla documentation for more guides and tutorials.

๐Ÿ™‹โ€โ™€๏ธ Join the Argilla community! A good place to start is the discussion forum.