Annotation metrics#

Here we describe the available metrics in Argilla:

  • Agreement Metrics: Metrics of agreement on an annotation task

  • Annotator Metrics: Metrics for annotators. Includes both metrics per annotator and unified metrics for all annotators.

Base Metric#

class argilla.client.feedback.metrics.base.AgreementMetricResult(*, metric_name, count, result)#

Container for the result of an agreement metric.

It contains two fields, metric_name and result with the value of the metric.

Parameters:
  • metric_name (str) โ€“

  • count (int) โ€“

  • result (float) โ€“

class argilla.client.feedback.metrics.base.ModelMetricResult(*, metric_name, count, result)#

Container for the result of an annotator metric.

It contains two fields, metric_name and result with the value of the metric.

Parameters:
  • metric_name (str) โ€“

  • count (int) โ€“

  • result (Union[float, Dict[str, float], DataFrame, Dict[str, DataFrame]]) โ€“

class argilla.client.feedback.metrics.base.MetricBase(dataset, question_name, responses_vs_suggestions=True)#
Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • responses_vs_suggestions (bool) โ€“

__init__(dataset, question_name, responses_vs_suggestions=True)#

Initializes a AgreementMetric object to compute agreement metrics on a FeedbackDataset for a given question.

Parameters:
  • dataset (FeedbackDataset) โ€“ FeedbackDataset to compute the metrics.

  • question_name (str) โ€“ Name of the question for which we want to analyse the agreement.

  • responses_vs_suggestions (bool) โ€“ Whether to compare the responses vs the suggestions, or the other way around. Defaults to True (the metrics will be compared assuming the responses are the ground truth and the suggestions are the predictions).

Raises:

NotImplementedError โ€“ If the question type is not supported.

Return type:

None

property allowed_metrics: List[str]#

Available metrics for the given question.

Agreement Metrics#

This module contains metrics to gather information related to inter-Annotator agreement.

class argilla.client.feedback.metrics.agreement_metrics.KrippendorfAlpha(annotated_dataset=None, distance_function=None)#

Krippendorfโ€™s alpha agreement metric.

Is a statistical measure of the inter-annotator agreement achieved when coding a set of units of analysis.

To interpret the results from this metric, we refer the reader to the wikipedia entry. The common consensus dictates that a value of alpha >= 0.8 indicates a reliable annotation, a value >= 0.667 can only guarantee tentative conclusions, while lower values suggest an unreliable annotation.

Parameters:
  • annotated_dataset (FormattedResponses) โ€“

  • distance_function (Callable) โ€“

class argilla.client.feedback.metrics.agreement_metrics.NLTKAnnotationTaskMetric(annotated_dataset=None, distance_function=None)#

Base class for metrics that use the nltkโ€™s AnnotationTask class.

These metrics make use of a distance function to compute the distance between

It is often the case that we donโ€™t want to treat two different labels as complete disagreement, and so the AnnotationTask constructor can also take a distance metric as a final argument. Distance metrics are functions that take two arguments, and return a value between 0.0 and 1.0 indicating the distance between them.

By default, the following distance metrics are provided for each type of question:

For LabelQuestion, binary_distance:

>>> am.binary_distance("a", "b")
1.0
>>> am.binary_distance("a", "a")
0.0

For MultiLabelQuestion, masi_distance:

>>> label_sets = [
...     [frozenset(["a", "b"]), frozenset(["b", "a"])],
...     [frozenset(["a"]), frozenset(["a", "b"])],
...     [frozenset(["c"]), frozenset(["a", "b"])],
... ]
>>> for a, b in label_sets:
...     print((a,b), am.masi_distance(a,b))
...
(frozenset({'a', 'b'}), frozenset({'a', 'b'})) 0.0
(frozenset({'a'}), frozenset({'a', 'b'})) 0.665
(frozenset({'c'}), frozenset({'a', 'b'})) 1.0

For RatingQuestion, interval_distance:

>>> for a, b in [(1, 1), (1, 2), (3,6)]:
...     print((a,b), am.interval_distance(a,b))
...
(1, 1) 0
(1, 2) 1
(3, 6) 9

For RankingQuestion, kendall_tau_dist:

>>> for i, a in enumerate(itertools.permutations(values, len(values))):
...     for j, b in enumerate(itertools.permutations(values, len(values))):
...         if j >= i:
...             print((a, b), kendall_tau_dist(a,b))
...
((1, 2, 3), (1, 2, 3)) 0.0
((1, 2, 3), (1, 3, 2)) 0.3333333333333333
((1, 2, 3), (2, 1, 3)) 0.3333333333333333
((1, 2, 3), (2, 3, 1)) 0.6666666666666667
((1, 2, 3), (3, 1, 2)) 0.6666666666666667
((1, 2, 3), (3, 2, 1)) 1.0
((1, 3, 2), (1, 3, 2)) 0.0
...
Parameters:
  • annotated_dataset (FormattedResponses) โ€“

  • distance_function (Callable) โ€“

class argilla.client.feedback.metrics.agreement_metrics.AgreementMetric(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Main class to compute agreement metrics.

Example

>>> import argilla as rg
>>> from argilla.client.feedback.metrics import AgreementMetric
>>> metric = AgreementMetric(dataset=dataset, question_name=question, filter_by={"response_status": "submitted"})
>>> metrics_report = metric.compute("alpha")
Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

__init__(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Initialize a AgreementMetric object to compute agreement metrics.

Parameters:
  • dataset (FeedbackDataset) โ€“ FeedbackDataset to compute the metrics.

  • question_name (str) โ€“ Name of the question for which we want to analyse the agreement.

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“ A dict with key the field to filter by, and values the filters to apply. Can be one of: draft, pending, submitted, and discarded. If set to None, no filter will be applied. Defaults to None (no filter is applied).

  • sort_by (Optional[List[SortBy]]) โ€“ A list of SortBy objects to sort your dataset by. Defaults to None (no filter is applied).

  • max_records (Optional[int]) โ€“ The maximum number of records to use for training. Defaults to None.

Return type:

None

compute(metric_names)#

Computes the agreement metrics for the given question.

Parameters:
  • metric_names (Union[str, List[str]]) โ€“ name or list of names for the metrics to compute. i.e. alpha.

  • kwargs โ€“ additional arguments to pass to the metric.

Raises:

ValueError โ€“ If the metric name is not supported for the given question.

Returns:

A list of AgreementMetricResult objects for the dataset.

Return type:

agreement_metrics

Annotator Metrics#

This module contains metrics for Suggestions Metric and Responses Metric.

class argilla.client.feedback.metrics.annotator_metrics.AccuracyMetric(responses=None, suggestions=None)#

Accuracy score.

Which proportion of the responses are equal to the suggestions offered.

We use the implementation in: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html#sklearn.metrics.accuracy_score

In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.AnnotatorMetric(dataset, question_name, filter_by=None, sort_by=None, max_records=None, responses_vs_suggestions=True)#

Main class to compute annotator metrics. Annotator metrics refers to the combination of Suggestions Metric and Responses Metric. They are both different from the Agreement Metric (i.e. Inter-Annotator Agreement) and they are utilized to compute metrics contrasting suggestions vs responses.

Example

>>> import argilla as rg
>>> from argilla.client.feedback.metrics import AnnotatorMetric
>>> metric = AnnotatorMetric(dataset=dataset, question_name=question)
>>> metrics_report = metric.compute("accuracy")
Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

  • responses_vs_suggestions (bool) โ€“

compute(metric_names, show_progress=True)#

Computes the annotator metrics for the given question.

Parameters:
  • metric_names (Union[str, List[str]]) โ€“ name or list of names for the metrics to compute. i.e. accuracy

  • show_progress (bool) โ€“

Raises:

ValueError โ€“ If the metric name is not supported for the given question.

Returns:

dict with the metrics computed for each annotator, where the

key corresponds to the user id and the values are a list with the metric results.

Return type:

metrics

class argilla.client.feedback.metrics.annotator_metrics.ConfusionMatrixMetric(responses=None, suggestions=None)#

Compute confusion matrix to evaluate the accuracy of an annotator.

In case of multiclass classification, this function returns a confusion matrix class-wise.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.F1ScoreMetric(responses=None, suggestions=None)#

F1 score: 2 * (precision * recall) / (precision + recall)

We use the implementation in: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score

In case of multiclass data, calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.GLEUMetric(responses=None, suggestions=None)#

Improvement of BLEU that takes into account the length of the response.

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. The Google-BLEU is an improvement of BLEU that adresses some undesirable properties found on single sentences.

https://huggingface.co/spaces/evaluate-metric/bleu https://huggingface.co/spaces/evaluate-metric/google_bleu

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.ModelMetric(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Where suggestions are the ground truths and the responses are compared against them.

Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelAccuracyMetric(responses=None, suggestions=None)#

Computes the accuracy on the binarized data for multilabel classification.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelConfusionMatrixMetric(responses=None, suggestions=None)#

Compute confusion matrix to evaluate the accuracy of an annotator.

The data is binarized, so we will return a dict with the confusion matrix for each class.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelF1ScoreMetric(responses=None, suggestions=None)#

Computes the f1-score on the binarized data for multilabel classification.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelMetrics(responses=None, suggestions=None)#

Parent class for MultiLabel based metrics. It binarizes the data to compute the metrics.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelPrecisionMetric(responses=None, suggestions=None)#

Computes the precision on the binarized data for multilabel classification.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.MultiLabelRecallMetric(responses=None, suggestions=None)#

Computes the recall on the binarized data for multilabel classification.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.NDCGMetric(responses=None, suggestions=None)#

Compute Normalized Discounted Cumulative Gain.

From the Wikipedia page for Discounted Cumulative Gain:

โ€œDiscounted cumulative gain (DCG) is a measure of ranking quality. In information retrieval, it is often used to measure effectiveness of web search engine algorithms or related applications. Using a graded relevance scale of documents in a search-engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated from the top of the result list to the bottom, with the gain of each result discounted at lower ranksโ€

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.PearsonCorrelationCoefficientMetric(responses=None, suggestions=None)#
Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.PrecisionMetric(responses=None, suggestions=None)#

Compute the precision: tp / (tp + fp)

We use the implementation in: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_score.html#sklearn.metrics.precision_score

In case of multiclass data, calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.ROUGEMetric(responses=None, suggestions=None)#

From the evaluate library:

ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation. Note that ROUGE is case insensitive, meaning that upper case letters are treated the same way as lower case letters.

https://huggingface.co/spaces/evaluate-metric/rouge

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.RecallMetric(responses=None, suggestions=None)#

Compute the recall: tp / (tp + fn)

We use the implementation in: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.recall_score.html#sklearn.metrics.recall_score

In case of multiclass data, calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.SpearmanCorrelationCoefficientMetric(responses=None, suggestions=None)#
Parameters:
  • responses (List[Union[float, int, str]]) โ€“

  • suggestions (List[Union[float, int, str]]) โ€“

class argilla.client.feedback.metrics.annotator_metrics.UnifiedAnnotatorMetric(dataset, question_name, strategy_name='majority', filter_by=None, sort_by=None, max_records=None, responses_vs_suggestions=True)#

Main class to compute metrics for a unified dataset.

Example

>>> import argilla as rg
>>> from argilla.client.feedback.metrics import UnifiedAnnotatorMetric
>>> metric = UnifiedAnnotatorMetric(dataset=dataset, question_name=question)
>>> metrics_report = metric.compute("accuracy")
Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • strategy_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

  • responses_vs_suggestions (bool) โ€“

compute(metric_names)#

Computes the unified annotation metrics for the given question.

Parameters:
  • metric_names (Union[str, List[str]]) โ€“ name or list of names for the metrics to compute. i.e. accuracy

  • kwargs โ€“ additional arguments to pass to the metric.

Raises:

ValueError โ€“ If the metric name is not supported for the given question.

Returns:

List of annotator metrics results if more than one metric is computed, or the result

container if only one metric is computed.

Return type:

metrics

class argilla.client.feedback.metrics.annotator_metrics.ModelMetric(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Where suggestions are the ground truths and the responses are compared against them.

Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

__init__(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Initialize an AnnotatorMetric object to compute agreement metrics for both Suggestions Metric and Responses Metric.

Parameters:
  • dataset (FeedbackDataset) โ€“ FeedbackDataset to compute the metrics.

  • question_name (str) โ€“ Name of the question for which we want to analyse the agreement.

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“ A dict with key the field to filter by, and values the filters to apply. Can be one of: draft, pending, submitted, and discarded. If set to None, no filter will be applied. Defaults to None (no filter is applied).

  • sort_by (Optional[List[SortBy]]) โ€“ A list of SortBy objects to sort your dataset by. Defaults to None (no filter is applied).

  • max_records (Optional[int]) โ€“ The maximum number of records to use for training. Defaults to None.

  • responses_vs_suggestions โ€“ Whether to utilize Suggestions Metric (where the suggestions are the ground truths and the responses are compared against them) or Responses Metric (where the responses are the ground truths and the suggestions are compared against them). Defaults to True, i.e. Responses Metric.

Return type:

None

compute(metric_names, show_progress=True)#

Computes the annotator metrics for the given question.

Parameters:
  • metric_names (Union[str, List[str]]) โ€“ name or list of names for the metrics to compute. i.e. accuracy

  • show_progress (bool) โ€“

Raises:

ValueError โ€“ If the metric name is not supported for the given question.

Returns:

dict with the metrics computed for each annotator, where the

key corresponds to the user id and the values are a list with the metric results.

Return type:

metrics

class argilla.client.feedback.metrics.annotator_metrics.UnifiedModelMetric(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#
Parameters:
  • dataset (FeedbackDataset) โ€“

  • question_name (str) โ€“

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“

  • sort_by (Optional[List[SortBy]]) โ€“

  • max_records (Optional[int]) โ€“

__init__(dataset, question_name, filter_by=None, sort_by=None, max_records=None)#

Initialize an AnnotatorMetric object to compute agreement metrics for both Suggestions Metric and Responses Metric.

Parameters:
  • dataset (FeedbackDataset) โ€“ FeedbackDataset to compute the metrics.

  • question_name (str) โ€“ Name of the question for which we want to analyse the agreement.

  • filter_by (Optional[Dict[str, Union[ResponseStatusFilter, List[ResponseStatusFilter]]]]) โ€“ A dict with key the field to filter by, and values the filters to apply. Can be one of: draft, pending, submitted, and discarded. If set to None, no filter will be applied. Defaults to None (no filter is applied).

  • sort_by (Optional[List[SortBy]]) โ€“ A list of SortBy objects to sort your dataset by. Defaults to None (no filter is applied).

  • max_records (Optional[int]) โ€“ The maximum number of records to use for training. Defaults to None.

  • responses_vs_suggestions โ€“ Whether to utilize Suggestions Metric (where the suggestions are the ground truths and the responses are compared against them) or Responses Metric (where the responses are the ground truths and the suggestions are compared against them). Defaults to True, i.e. Responses Metric.

Return type:

None

compute(metric_names)#

Computes the unified annotation metrics for the given question.

Parameters:
  • metric_names (Union[str, List[str]]) โ€“ name or list of names for the metrics to compute. i.e. accuracy

  • kwargs โ€“ additional arguments to pass to the metric.

Raises:

ValueError โ€“ If the metric name is not supported for the given question.

Returns:

List of annotator metrics results if more than one metric is computed, or the result

container if only one metric is computed.

Return type:

metrics