๐บ๏ธ Adding bias-equality features to text with disaggregators
#
In this tutorial, we will show you how you can use the disaggregators
package to identify potential bias in your training data. We will walk you through this by using the following steps:
๐ฐ Load news summary data
๐บ๏ธ apply disaggregator features
๐ analyze potential biases
Introduction#
โAddressing fairness and bias in machine learning models is more important than ever! One form of fairness is equal performance across different groups or features. To measure this, evaluation datasets must be disaggregated across the different groups of interest.โ - HuggingFace.
In short, the disaggregators
package, aims to answer the question: โWhat is in your dataset and how does this influence groups of interest?โ.
For other bias and explainability measures take a look at our other tutorials on explainability.
Letโs get started!
Setup#
Apart from Argilla, weโll need a few third party libraries that can be installed via pip:
[ ]:
%pip install disaggregators -qqq
%python -m spacy download en_core_web_lg -qqq
๐ฐ Load news summary data#
For this analysis, we will be using our news summary dataset from the HuggingFace hub. This datasets is focused on a text2text summarization task, which requires news texts to be summarized into a single sentence or title. Due to the nice integration with the HuggingFace hub, we can easily do this within several lines of code.
[14]:
import argilla as rg
from datasets import load_dataset
# load from datasets
my_dataset = load_dataset("argilla/news-summary")
dataset_rg = rg.read_datasets(my_dataset["train"], task="Text2Text")
# log subset into argilla
rg.log(dataset_rg[:1000], "news-summary", chunk_size=50) # set smaller chunk size to overcome io-issues
1000 records logged to https://pre.argilla.io/datasets/recognai/news-summary
[14]:
BulkResponse(dataset='news-summary', processed=1000, failed=0)
๐บ๏ธ apply disaggregator features#
After having uploaded the data, we can now take a closer look at the potential disaggregators that the disaggregators
package provides. It focuses on 5 main classes, with several sub-classes that can be assigned to the text, based on word-matches. This means, each text can also be assigned to multiple classes.
โageโ: [โchildโ, โyouthโ, โadultโ, โseniorโ]
โgenderโ: [โmaleโ, โfemaleโ]
โpronounโ: [โshe_herโ, โhe_himโ, โthey_themโ]
โreligionโ: [โjudaismโ, โislamโ, โbuddhismโ, โchristianityโ]
โcontinentโ: [โafricaโ, โamericasโ, โasiaโ, โeuropeโ, โoceaniaโ]
Even though we could choose to apply all categories, we can we will now only work with age
and gender
to simplify the anlysis.
[18]:
from disaggregators import Disaggregator
import pandas as pd
import argilla as rg
disaggregator_classes = ["age", "gender"]]
ds = rg.load("news-summary")
df = pd.DataFrame({"text": [rec.text for rec in ds]})
disaggregator = Disaggregator(disaggregator_classes, column="text")
new_cols = df.apply(disaggregator, axis=1)
df = pd.merge(df, pd.json_normalize(new_cols), left_index=True, right_index=True)
df.head(5)
[18]:
text | age.child | age.youth | age.adult | age.senior | gender.male | gender.female | |
---|---|---|---|---|---|---|---|
0 | MEXICO CITY (Reuters) - Mexico central bank go... | True | True | False | False | True | False |
1 | WASHINGTON (Reuters) - The Trump administratio... | True | False | False | True | True | False |
2 | DUBAI (Reuters) - Iran has provided the capabi... | False | False | False | False | False | False |
3 | PALM BEACH, Fla. (Reuters) - U.S. President-el... | False | False | False | False | True | False |
4 | WASHINGTON (Reuters) - U.S. Senator Bill Nelso... | False | False | False | False | True | False |
Now, we have found and apprehended each of the potential disaggregators
, we can assign them to the metadata
variable for each one of our records and update the same record ids in the Argilla database.
[26]:
metadata_ds = df[df.columns[1:]].to_dict(orient="records")
for metadata_rec, rec in zip(metadata_ds, ds):
rec.metadata = metadata_rec
rg.log(ds, "news-summary", chunk_size=50) # upsert records
1000 records logged to https://pre.argilla.io/datasets/recognai/news-summary
[26]:
BulkResponse(dataset='news-summary', processed=1000, failed=0)
๐ analyze potential biases#
Within the UI, there are two direct ways in which we can analyze the assigned bias-info.
Filter based on metadata info#
By applying filters, we can choose to equally distribute the number of annotations over the potential causes for bias. By doing so, we ensure the eventual training data is also evenly distributed. Alternatively, we can also decide to only label data that has zero disaggregation, assuming they do not contain any of the considered biases.
Inspect record info#
Even though inspecting the record info is a bit slower, we can potentially assume that it might provide context to record for annotators that might suspect bias within the data. This will allow them to take this into account during annotation.
Alternatives#
Besides the analyses mentioned above, there likely are way more interesting things you can do using this package. A good example being this HuggingFace space. So, be creative and avoid bias while doing so ๐
Summary#
In this tutorial, we learned about the disaggregators
package, and how we can integrate this within Argilla. This can help data-scientist, ML-engineers and annotators to manage and mitigate bias in their datasets.
Next steps#
โญ Argilla Github repo to stay updated.
๐ Argilla documentation for more guides and tutorials.
๐โโ๏ธ Join the Argilla community! A good place to start is the discussion forum.