Open In Colab  View Notebook on GitHub

Filter and Query Records#

This tutorial is part of a series in which we will get to know the FeedbackDataset. In this step, we will show how to filter and query our records, a process that can be really useful for efficiently analyzing data, extracting relevant information, managing data size, and enabling focused insights for informed decision-making. You can have a look at the previous tutorials to add metadata, vectors and suggestions and responses. Feel free to check out the practical guides page for more in-depth information.

workflow

Table of contents#

  1. Pull the Dataset

    1. From Argilla

    2. From HuggingFace Hub

  2. Filter

    1. By Fields Content

    2. By Metadata Property

    3. By Suggestions and Responses

  3. Sort

  4. Semantic Search

  5. Conclusion

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

deploy on spaces

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argillaโ€™s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.

Tip

This tutorial is a Jupyter Notebook. There are two options to run it:

  • Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโ€™t forget to change the runtime type to GPU for faster model training and inference.

  • Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

First, letโ€™s install our dependencies and import the necessary libraries:

[ ]:
!pip install argilla
!pip install datasets
[ ]:
import argilla as rg
from argilla._constants import DEFAULT_API_KEY
from sentence_transformers import SentenceTransformer

In order to run this notebook we will need some credentials to push and load datasets from Argilla and ๐Ÿค— Hub, letโ€™s set them in the following cell:

[ ]:
# Argilla credentials
api_url = "http://localhost:6900"  # "https://<YOUR-HF-SPACE>.hf.space"
api_key = DEFAULT_API_KEY  # admin.apikey
# Huggingface credentials
hf_token = "hf_..."

Log in to argilla:

[ ]:
rg.init(api_url=api_url, api_key=api_key)

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:
try:
    from argilla.utils.telemetry import tutorial_running
    tutorial_running()
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

Pull the Dataset#

As we uploaded the dataset we created in the previous tutorial to both Argilla and HuggingFace Hub, we can pull the dataset from either of them, however, we only allow filtering and querying on RemoteFeedbackDataset, hence we will only pull the dataset from Argilla.

From Argilla#

We can pull the dataset from Argilla by using the from_argilla method.

[10]:
dataset_remote_with_metadata = rg.FeedbackDataset.from_argilla("end2end_textclassification_with_metadata")
dataset_remote_with_vectors = rg.FeedbackDataset.from_argilla("end2end_textclassification_with_vectors")
dataset_remote_with_suggestions_and_responses = rg.FeedbackDataset.from_argilla("end2end_textclassification_with_suggestions_and_responses")

From HuggingFace Hub#

Not all sorting and filtering is supported with a local FeedbackDataset, pulled from HuggingFace Hub. Hence, we will only pull the dataset from Argilla.

Note

The dataset pulled from HuggingFace Hub is an instance of FeedbackDataset whereas the dataset pulled from Argilla is an instance of RemoteFeedbackDataset. The difference between the two is that the former is a local one and the changes made on it stay locally. On the other hand, the latter is a remote one and the changes made on it are directly reflected on the dataset on the Argilla server, which can make your process faster.

Filter#

Filtering allow us to select a subset of the records in our dataset based on a condition. We can filter our dataset by using the filter_by method that return a FeedbackDataset with a subset of the records. We can filter by field by metadata property and by status.

Note: The records wonโ€™t be modified unless updates or deletions are specifically applied at record-level.

By Fields Content#

We can filter our dataset by the content of the fields. We will only need to write in the search bar in the top left corner on top of the record card the content you want to filter by. In the image, you can see that we searched for the records with the โ€˜blueโ€™ word (which appears highlighted), and two with this requirement were found.

filter_by_fields.PNG

By Metadata Property#

The UI allow us to filter using the metadata properties and combine the needed filter. Below, you can see an example, where we filtered by our metadata Annotation Group, Length of the text and Standard Deviation properties, so that from the 1000 records, we only got 242. Note that if they were set to visible_for_annotators=False, it would only appear for users with the admin or owner role.

This can also be very useful to assign records to your team members in a common workspace. Please refer to the metadata tutorial and how to assign records for more information.

filter-by-metadata.PNG

Now, we will make the same filtering but using the filter_by method provided in the Python SDK. Thus, we will will need to combine the three filters. In addition, each metadata is a different type, so we will need to use TermsMetadataFilter for the annotation group, IntegerMetadataFilter for the length of the text, and FloatMetadataFilter for the standard deviation. We will be using the following parameters:

Description

TermsMetadataFilter

IntegerMetadataFilter

FloatMetadataFilter

name

name of the metadata property

group

length

length_std

ge

values greater than or equal

no-required

0

204

le

values lower than or equal

no-required

282

290

values

values searched

group-1 and group-2

no-required

no-required

In the case of Integer and Float filters at least one of ge or le should be provided.

[16]:
filtered_records = dataset_remote_with_metadata.filter_by(
    metadata_filters=[
        rg.TermsMetadataFilter(
            name="group",
            values=["group-1", "group-2"]
        ),
        rg.IntegerMetadataFilter(
            name="length",
            le=282
        ),
        rg.FloatMetadataFilter(
            name="length_std",
            ge=204,
            le=290
        ),
    ]
)

print(len(filtered_records))
242

By Status#

We can also filter by the status. The response_status can be of the following types:

  • missing, if the records have no responses, or draft, if they have them but they are not submitted yet. In both cases, in the UI will appear as Pending.

  • discarded, if the responses were discarded.

  • submitted, if the responses are submitted.

filter-by-status.PNG

So, now we will use the UI to annotate and discard some samples to be used as examples. And we will start to filter.

  • First, we want to check the submitted records so we can compute the time it is taking us to complete the annotation process. And we can see that we still have many work to do.

[20]:
filtered_dataset = dataset_remote_with_metadata.filter_by(response_status=["submitted"])
print('Submitted records:', len(filtered_dataset))
Submitted records: 10
  • Now, we want to check those that are pending, so we will filter by missing and draft status. And we can see that we have 973 records pending to be annotated.

[21]:
filtered_dataset = dataset_remote_with_metadata.filter_by(response_status=["missing", "draft"])
print('Pending records:', len(filtered_dataset))
Pending records: 973
  • Finally, we will check how many records were discarded from the filtered records using metadata.

[28]:
filtered_dataset = dataset_remote_with_metadata.filter_by(
    metadata_filters=[
        rg.TermsMetadataFilter(
            name="group",
            values=["group-1", "group-2"]
        ),
        rg.IntegerMetadataFilter(
            name="length",
            le=282
        ),
        rg.FloatMetadataFilter(
            name="length_std",
            ge=204,
            le=290
        ),
    ],
    response_status=["discarded"]
)

print('Discarded records:', len(filtered_dataset))
Discarded records: 7

By Suggestions and Responses#

Within the UI filters, you can filter records according to the value of responses given by the current user. The type of the questions should be LabelQuestion, MultiLabelQuestion or RatingQuestion. If you prefer to filter records based on suggestions it is possible to gilter by suggestion score, value and agent. The option in Python SDK will be available soon.

Sort#

We can also order the records according to one or several attributes. In the UI, this can be done easily using the Sort menu. So, we will focus on how to do it in the Python SDK using sort_by. This methos will allow us to sort by the last updated (updated_at) or any metadata properties (metadata.my-metadata-name) in ascending or descending order.

  • First, we will sort the records to know the record that was the lasted updated and by groups.

[39]:
from argilla import SortBy

sorted_records = dataset_remote_with_suggestions_and_responses.sort_by(
    [
        SortBy(field="updated_at", order="desc"),
        SortBy(field="metadata.group", order="asc")
    ]
)
  • Then, we want to order the filtered records in the same way. So, we can combine filter_by and sort_by.

[40]:
filtered_dataset = dataset_remote_with_suggestions_and_responses.filter_by(
    response_status=["discarded"]
).sort_by(
    [
        SortBy(field="updated_at", order="desc"),
        SortBy(field="metadata.group", order="asc")
    ]
)

Conclusion#

In this tutorial, we learned how to filter and query our records. We delved into the specifics of filtering based on field content, metadata attributes, various statuses and suggestions and responses. Additionally, we explored methods to order our records, whether by the most recent updates or by any metadata characteristic. Finally, we saw how to use the semantic search feature.