`rg.Dataset.records`¶

Usage Examples¶

In most cases, you will not need to create a DatasetRecords object directly. Instead, you can access it via the Dataset object:

dataset.records

For user familiar with legacy approaches

Dataset.records object is used to interact with the records in a dataset. It interactively fetches records from the server in batches without using a local copy of the records.
The log method of Dataset.records is used to both add and update records in a dataset. If the record includes a known id field, the record will be updated. If the record does not include a known id field, the record will be added.

Adding records to a dataset¶

To add records to a dataset, use the log method. Records can be added as dictionaries or as Record objects. Single records can also be added as a dictionary or Record.

As a Record objectFrom a data structureFrom a data structure with a mappingFrom a Hugging Face dataset

You can also add records to a dataset by initializing a Record object directly.

records = [
    rg.Record(
        fields={
            "question": "Do you need oxygen to breathe?",
            "answer": "Yes"
        },
    ),
    rg.Record(
        fields={
            "question": "What is the boiling point of water?",
            "answer": "100 degrees Celsius"
        },
    ),
] # (1)

dataset.records.log(records)

This is an illustration of a definition. In a real world scenario, you would iterate over a data structure and create Record objects for each iteration.

data = [
    {
        "question": "Do you need oxygen to breathe?",
        "answer": "Yes",
    },
    {
        "question": "What is the boiling point of water?",
        "answer": "100 degrees Celsius",
    },
] # (1)

dataset.records.log(data)

The data structure's keys must match the fields or questions in the Argilla dataset. In this case, there are fields named question and answer.

data = [
    {
        "query": "Do you need oxygen to breathe?",
        "response": "Yes",
    },
    {
        "query": "What is the boiling point of water?",
        "response": "100 degrees Celsius",
    },
] # (1)
dataset.records.log(
    records=data,
    mapping={"query": "question", "response": "answer"} # (2)
)

The data structure's keys must match the fields or questions in the Argilla dataset. In this case, there are fields named question and answer.
The data structure has keys query and response and the Argilla dataset has question and answer. You can use the mapping parameter to map the keys in the data structure to the fields in the Argilla dataset.

You can also add records to a dataset using a Hugging Face dataset. This is useful when you want to use a dataset from the Hugging Face Hub and add it to your Argilla dataset.

You can add the dataset where the column names correspond to the names of fields, questions, metadata or vectors in the Argilla dataset.

If the dataset's schema does not correspond to your Argilla dataset names, you can use a mapping to indicate which columns in the dataset correspond to the Argilla dataset fields.

from datasets import load_dataset

hf_dataset = load_dataset("imdb", split="train[:100]") # (1)

dataset.records.log(records=hf_dataset)

In this example, the Hugging Face dataset matches the Argilla dataset schema. If that is not the case, you could use the .map of the datasets library to prepare the data before adding it to the Argilla dataset.

Here we use the mapping parameter to specify the relationship between the Hugging Face dataset and the Argilla dataset.

dataset.records.log(records=hf_dataset, mapping={"txt": "text", "y": "label"}) # (1)

In this case, the txt key in the Hugging Face dataset corresponds to the text field in the Argilla dataset, and the y key in the Hugging Face dataset corresponds to the label field in the Argilla dataset.

Updating records in a dataset¶

Records can also be updated using the log method with records that contain an id to identify the records to be updated. As above, records can be added as dictionaries or as Record objects.

As a Record objectFrom a data structureFrom a data structure with a mappingFrom a Hugging Face dataset

You can update records in a dataset by initializing a Record object directly and providing the id field.

records = [
    rg.Record(
        metadata={"department": "toys"},
        id="2" # (1)
    ),
]

dataset.records.log(records)

The id field is required to identify the record to be updated. The id field must be unique for each record in the dataset. If the id field is not provided, the record will be added as a new record.

You can also update records in a dataset by providing the id field in the data structure.

data = [
    {
        "metadata": {"department": "toys"},
        "id": "2" # (1)
    },
]

dataset.records.log(data)

The id field is required to identify the record to be updated. The id field must be unique for each record in the dataset. If the id field is not provided, the record will be added as a new record.

You can also update records in a dataset by providing the id field in the data structure and using a mapping to map the keys in the data structure to the fields in the dataset.

data = [
    {
        "metadata": {"department": "toys"},
        "my_id": "2" # (1)
    },
]

dataset.records.log(
    records=data,
    mapping={"my_id": "id"} # (2)
)

The id field is required to identify the record to be updated. The id field must be unique for each record in the dataset. If the id field is not provided, the record will be added as a new record.
Let's say that your data structure has keys my_id instead of id. You can use the mapping parameter to map the keys in the data structure to the fields in the dataset.

You can also update records to an Argilla dataset using a Hugging Face dataset. To update records, the Hugging Face dataset must contain an id field to identify the records to be updated, or you can use a mapping to map the keys in the Hugging Face dataset to the fields in the Argilla dataset.

from datasets import load_dataset

hf_dataset = load_dataset("imdb", split="train[:100]") # (1)

dataset.records.log(records=hf_dataset, mapping={"uuid": "id"}) # (2)

In this example, the Hugging Face dataset matches the Argilla dataset schema.
The uuid key in the Hugging Face dataset corresponds to the id field in the Argilla dataset.

Adding and updating records with images¶

Argilla datasets can contain image fields. You can add images to a dataset by passing the image to the record object as either a remote URL, a local path to an image file, or a PIL object. The field names must be defined as an rg.ImageField in the dataset's Settings object to be accepted. Images will be stored in the Argilla database and returned using the data URI schema.

As PIL objects

To retrieve the images as rescaled PIL objects, you can use the to_datasets method when exporting the records, as shown in this how-to guide.

From a data structure with remote URLsFrom a data structure with local files or PIL objectsFrom a Hugging Face dataset

data = [
    {
        "image": "https://example.com/image1.jpg",
    },
    {
        "image": "https://example.com/image2.jpg",
    },
]

dataset.records.log(data)

import os
from PIL import Image

image_dir = "path/to/images"

data = [
    {
        "image": os.path.join(image_dir, "image1.jpg"), # (1)
    },
    {
        "image": Image.open(os.path.join(image_dir, "image2.jpg")), # (2)
    },
]

dataset.records.log(data)

The image is a local file path.
The image is a PIL object.

Hugging Face datasets can be passed directly to the log method. The image field must be defined as an Image in the dataset's features.

hf_dataset = load_dataset("ylecun/mnist", split="train[:100]")
dataset.records.log(records=hf_dataset)

If the image field is not defined as an Image in the dataset's features, you can cast the dataset to the correct schema before adding it to the Argilla dataset. This is only necessary if the image field is not defined as an Image in the dataset's features, and is not one of the supported image types by Argilla (URL, local path, or PIL object).

hf_dataset = load_dataset("<my_custom_dataset>") # (1)
hf_dataset = hf_dataset.cast(
    features=Features({"image": Image(), "label": Value("string")}),
)
dataset.records.log(records=hf_dataset)

In this example, the Hugging Face dataset matches the Argilla dataset schema but the image field is not defined as an Image in the dataset's features.

Iterating over records in a dataset¶

Dataset.records can be used to iterate over records in a dataset from the server. The records will be fetched in batches from the server::

for record in dataset.records:
    print(record)

# Fetch records with suggestions and responses
for record in dataset.records(with_suggestions=True, with_responses=True):
    print(record.suggestions)
    print(record.responses)

# Filter records by a query and fetch records with vectors
for record in dataset.records(query="capital", with_vectors=True):
    print(record.vectors)

Check out the rg.Record class reference for more information on the properties and methods available on a record and the rg.Query class reference for more information on the query syntax.

`DatasetRecords` ¶

Bases: Iterable[Record], LoggingMixin

This class is used to work with records from a dataset and is accessed via Dataset.records. The responsibility of this class is to provide an interface to interact with records in a dataset, by adding, updating, fetching, querying, deleting, and exporting records.

Attributes:

Name	Type	Description
`client`	`Argilla`	The Argilla client object.
`dataset`	`Dataset`	The dataset object.

Source code in src/argilla/records/_dataset_records.py

class DatasetRecords(Iterable[Record], LoggingMixin):
    """This class is used to work with records from a dataset and is accessed via `Dataset.records`.
    The responsibility of this class is to provide an interface to interact with records in a dataset,
    by adding, updating, fetching, querying, deleting, and exporting records.

    Attributes:
        client (Argilla): The Argilla client object.
        dataset (Dataset): The dataset object.
    """

    _api: RecordsAPI

    DEFAULT_BATCH_SIZE = 256
    DEFAULT_DELETE_BATCH_SIZE = 64

    def __init__(
        self, client: "Argilla", dataset: "Dataset", mapping: Optional[Dict[str, Union[str, Sequence[str]]]] = None
    ):
        """Initializes a DatasetRecords object with a client and a dataset.
        Args:
            client: An Argilla client object.
            dataset: A Dataset object.
        """
        self.__client = client
        self.__dataset = dataset
        self._mapping = mapping or {}
        self._api = self.__client.api.records

    def __iter__(self):
        return DatasetRecordsIterator(self.__dataset, self.__client, with_suggestions=True, with_responses=True)

    def __call__(
        self,
        query: Optional[Union[str, Query]] = None,
        batch_size: Optional[int] = DEFAULT_BATCH_SIZE,
        start_offset: int = 0,
        with_suggestions: bool = True,
        with_responses: bool = True,
        with_vectors: Optional[Union[List, bool, str]] = None,
        limit: Optional[int] = None,
    ) -> DatasetRecordsIterator:
        """Returns an iterator over the records in the dataset on the server.

        Parameters:
            query: A string or a Query object to filter the records.
            batch_size: The number of records to fetch in each batch. The default is 256.
            start_offset: The offset from which to start fetching records. The default is 0.
            with_suggestions: Whether to include suggestions in the records. The default is True.
            with_responses: Whether to include responses in the records. The default is True.
            with_vectors: A list of vector names to include in the records. The default is None.
                If a list is provided, only the specified vectors will be included.
                If True is provided, all vectors will be included.
            limit: The maximum number of records to fetch. The default is None.

        Returns:
            An iterator over the records in the dataset on the server.

        """
        if query and isinstance(query, str):
            query = Query(query=query)

        if with_vectors:
            self._validate_vector_names(vector_names=with_vectors)

        return DatasetRecordsIterator(
            dataset=self.__dataset,
            client=self.__client,
            query=query,
            batch_size=batch_size,
            start_offset=start_offset,
            with_suggestions=with_suggestions,
            with_responses=with_responses,
            with_vectors=with_vectors,
            limit=limit,
        )

    def __repr__(self) -> str:
        return f"{self.__class__.__name__}({self.__dataset})"

    ############################
    # Public methods
    ############################

    def log(
        self,
        records: Union[List[dict], List[Record], HFDataset],
        mapping: Optional[Dict[str, Union[str, Sequence[str]]]] = None,
        user_id: Optional[UUID] = None,
        batch_size: int = DEFAULT_BATCH_SIZE,
        on_error: RecordErrorHandling = RecordErrorHandling.RAISE,
    ) -> "DatasetRecords":
        """Add or update records in a dataset on the server using the provided records.
        If the record includes a known `id` field, the record will be updated.
        If the record does not include a known `id` field, the record will be added as a new record.
        See `rg.Record` for more information on the record definition.

        Parameters:
            records: A list of `Record` objects, a Hugging Face Dataset, or a list of dictionaries representing the records.
                     If records are defined as a dictionaries or a dataset, the keys/ column names should correspond to the
                     fields in the Argilla dataset's fields and questions. `id` should be provided to identify the records when updating.
            mapping: A dictionary that maps the keys/ column names in the records to the fields or questions in the Argilla dataset.
                     To assign an incoming key or column to multiple fields or questions, provide a list or tuple of field or question names.
            user_id: The user id to be associated with the records' response. If not provided, the current user id is used.
            batch_size: The number of records to send in each batch. The default is 256.

        Returns:
            A list of Record objects representing the updated records.
        """
        record_models = self._ingest_records(
            records=records, mapping=mapping, user_id=user_id or self.__client.me.id, on_error=on_error
        )
        batch_size = self._normalize_batch_size(
            batch_size=batch_size,
            records_length=len(record_models),
            max_value=self._api.MAX_RECORDS_PER_UPSERT_BULK,
        )

        created_or_updated = []
        records_updated = 0

        for batch in tqdm(
            iterable=range(0, len(records), batch_size),
            desc="Sending records...",
            total=len(records) // batch_size,
            unit="batch",
        ):
            self._log_message(message=f"Sending records from {batch} to {batch + batch_size}.")
            batch_records = record_models[batch : batch + batch_size]
            models, updated = self._api.bulk_upsert(dataset_id=self.__dataset.id, records=batch_records)
            created_or_updated.extend([Record.from_model(model=model, dataset=self.__dataset) for model in models])
            records_updated += updated

        records_created = len(created_or_updated) - records_updated
        self._log_message(
            message=f"Updated {records_updated} records and added {records_created} records to dataset {self.__dataset.name}",
            level="info",
        )

        return self

    def delete(
        self,
        records: List[Record],
        batch_size: int = DEFAULT_DELETE_BATCH_SIZE,
    ) -> List[Record]:
        """Delete records in a dataset on the server using the provided records
            and matching based on the id.

        Parameters:
            records: A list of `Record` objects representing the records to be deleted.
            batch_size: The number of records to send in each batch. The default is 64.

        Returns:
            A list of Record objects representing the deleted records.

        """
        mapping = None
        user_id = self.__client.me.id
        record_models = self._ingest_records(records=records, mapping=mapping, user_id=user_id)
        batch_size = self._normalize_batch_size(
            batch_size=batch_size,
            records_length=len(record_models),
            max_value=self._api.MAX_RECORDS_PER_DELETE_BULK,
        )

        records_deleted = 0
        for batch in tqdm(
            iterable=range(0, len(records), batch_size),
            desc="Sending records...",
            total=len(records) // batch_size,
            unit="batch",
        ):
            self._log_message(message=f"Sending records from {batch} to {batch + batch_size}.")
            batch_records = record_models[batch : batch + batch_size]
            self._api.delete_many(dataset_id=self.__dataset.id, records=batch_records)
            records_deleted += len(batch_records)

        self._log_message(
            message=f"Deleted {len(record_models)} records from dataset {self.__dataset.name}",
            level="info",
        )

        return records

    def to_dict(self, flatten: bool = False, orient: str = "names") -> Dict[str, Any]:
        """
        Return the records as a dictionary. This is a convenient shortcut for dataset.records(...).to_dict().

        Parameters:
            flatten (bool): The structure of the exported dictionary.
                - True: The record fields, metadata, suggestions and responses will be flattened.
                - False: The record fields, metadata, suggestions and responses will be nested.
            orient (str): The orientation of the exported dictionary.
                - "names": The keys of the dictionary will be the names of the fields, metadata, suggestions and responses.
                - "index": The keys of the dictionary will be the id of the records.
        Returns:
            A dictionary of records.

        """
        return self().to_dict(flatten=flatten, orient=orient)

    def to_list(self, flatten: bool = False) -> List[Dict[str, Any]]:
        """
        Return the records as a list of dictionaries. This is a convenient shortcut for dataset.records(...).to_list().

        Parameters:
            flatten (bool): The structure of the exported dictionaries in the list.
                - True: The record keys are flattened and a dot notation is used to record attributes and their attributes . For example, `label.suggestion` and `label.response`. Records responses are spread across multiple columns for values and users.
                - False: The record fields, metadata, suggestions and responses will be nested dictionary with keys for record attributes.
        Returns:
            A list of dictionaries of records.
        """
        data = self().to_list(flatten=flatten)
        return data

    def to_json(self, path: Union[Path, str]) -> Path:
        """
        Export the records to a file on disk.

        Parameters:
            path (str): The path to the file to save the records.

        Returns:
            The path to the file where the records were saved.

        """
        return self().to_json(path=path)

    def from_json(self, path: Union[Path, str]) -> List[Record]:
        """Creates a DatasetRecords object from a disk path to a JSON file.
            The JSON file should be defined by `DatasetRecords.to_json`.

        Args:
            path (str): The path to the file containing the records.

        Returns:
            DatasetRecords: The DatasetRecords object created from the disk path.

        """
        records = JsonIO._records_from_json(path=path)
        return self.log(records=records)

    def to_datasets(self) -> HFDataset:
        """
        Export the records to a HFDataset.

        Returns:
            The dataset containing the records.

        """

        return self().to_datasets()

    ############################
    # Private methods
    ############################

    def _ingest_records(
        self,
        records: Union[List[Dict[str, Any]], List[Record], HFDataset],
        mapping: Optional[Dict[str, Union[str, Sequence[str]]]] = None,
        user_id: Optional[UUID] = None,
        on_error: RecordErrorHandling = RecordErrorHandling.RAISE,
    ) -> List[RecordModel]:
        """Ingests records from a list of dictionaries, a Hugging Face Dataset, or a list of Record objects."""

        mapping = mapping or self._mapping
        if len(records) == 0:
            raise ValueError("No records provided to ingest.")

        record_mapper = IngestedRecordMapper(mapping=mapping, dataset=self.__dataset, user_id=user_id)

        if HFDatasetsIO._is_hf_dataset(dataset=records):
            records = HFDatasetsIO._record_dicts_from_datasets(hf_dataset=records, mapper=record_mapper)

        ingested_records = []
        for record in records:
            try:
                if isinstance(record, dict):
                    record = record_mapper(data=record)
                elif isinstance(record, Record):
                    record.dataset = self.__dataset
                else:
                    raise ValueError(
                        "Records should be a a list Record instances, "
                        "a Hugging Face Dataset, or a list of dictionaries representing the records."
                        f"Found a record of type {type(record)}: {record}."
                    )
            except Exception as e:
                if on_error == RecordErrorHandling.IGNORE:
                    self._log_message(
                        message=f"Failed to ingest record from dict {record}: {e}",
                        level="info",
                    )
                    continue
                elif on_error == RecordErrorHandling.WARN:
                    warnings.warn(f"Failed to ingest record from dict {record}: {e}")
                    continue
                raise RecordsIngestionError(f"Failed to ingest record from dict {record}") from e
            ingested_records.append(record.api_model())
        return ingested_records

    def _normalize_batch_size(self, batch_size: int, records_length, max_value: int):
        norm_batch_size = min(batch_size, records_length, max_value)

        if batch_size != norm_batch_size:
            self._log_message(
                message=f"The provided batch size {batch_size} was normalized. Using value {norm_batch_size}.",
                level="warning",
            )

        return norm_batch_size

    def _validate_vector_names(self, vector_names: Union[List[str], str]) -> None:
        if not isinstance(vector_names, list):
            vector_names = [vector_names]
        for vector_name in vector_names:
            if isinstance(vector_name, bool):
                continue
            if vector_name not in self.__dataset.schema:
                raise ValueError(f"Vector field {vector_name} not found in dataset schema.")

`init(client, dataset, mapping=None)` ¶

Initializes a DatasetRecords object with a client and a dataset. Args: client: An Argilla client object. dataset: A Dataset object.

Source code in src/argilla/records/_dataset_records.py

def __init__(
    self, client: "Argilla", dataset: "Dataset", mapping: Optional[Dict[str, Union[str, Sequence[str]]]] = None
):
    """Initializes a DatasetRecords object with a client and a dataset.
    Args:
        client: An Argilla client object.
        dataset: A Dataset object.
    """
    self.__client = client
    self.__dataset = dataset
    self._mapping = mapping or {}
    self._api = self.__client.api.records

`call(query=None, batch_size=DEFAULT_BATCH_SIZE, start_offset=0, with_suggestions=True, with_responses=True, with_vectors=None, limit=None)` ¶

Returns an iterator over the records in the dataset on the server.

Parameters:

Name	Type	Description	Default
`query`	`Optional[Union[str, Query]]`	A string or a Query object to filter the records.	`None`
`batch_size`	`Optional[int]`	The number of records to fetch in each batch. The default is 256.	`DEFAULT_BATCH_SIZE`
`start_offset`	`int`	The offset from which to start fetching records. The default is 0.	`0`
`with_suggestions`	`bool`	Whether to include suggestions in the records. The default is True.	`True`
`with_responses`	`bool`	Whether to include responses in the records. The default is True.	`True`
`with_vectors`	`Optional[Union[List, bool, str]]`	A list of vector names to include in the records. The default is None. If a list is provided, only the specified vectors will be included. If True is provided, all vectors will be included.	`None`
`limit`	`Optional[int]`	The maximum number of records to fetch. The default is None.	`None`

Returns:

Type	Description
`DatasetRecordsIterator`	An iterator over the records in the dataset on the server.

Source code in src/argilla/records/_dataset_records.py

def __call__(
    self,
    query: Optional[Union[str, Query]] = None,
    batch_size: Optional[int] = DEFAULT_BATCH_SIZE,
    start_offset: int = 0,
    with_suggestions: bool = True,
    with_responses: bool = True,
    with_vectors: Optional[Union[List, bool, str]] = None,
    limit: Optional[int] = None,
) -> DatasetRecordsIterator:
    """Returns an iterator over the records in the dataset on the server.

    Parameters:
        query: A string or a Query object to filter the records.
        batch_size: The number of records to fetch in each batch. The default is 256.
        start_offset: The offset from which to start fetching records. The default is 0.
        with_suggestions: Whether to include suggestions in the records. The default is True.
        with_responses: Whether to include responses in the records. The default is True.
        with_vectors: A list of vector names to include in the records. The default is None.
            If a list is provided, only the specified vectors will be included.
            If True is provided, all vectors will be included.
        limit: The maximum number of records to fetch. The default is None.

    Returns:
        An iterator over the records in the dataset on the server.

    """
    if query and isinstance(query, str):
        query = Query(query=query)

    if with_vectors:
        self._validate_vector_names(vector_names=with_vectors)

    return DatasetRecordsIterator(
        dataset=self.__dataset,
        client=self.__client,
        query=query,
        batch_size=batch_size,
        start_offset=start_offset,
        with_suggestions=with_suggestions,
        with_responses=with_responses,
        with_vectors=with_vectors,
        limit=limit,
    )

`log(records, mapping=None, user_id=None, batch_size=DEFAULT_BATCH_SIZE, on_error=RecordErrorHandling.RAISE)` ¶

Add or update records in a dataset on the server using the provided records. If the record includes a known id field, the record will be updated. If the record does not include a known id field, the record will be added as a new record. See rg.Record for more information on the record definition.

Parameters:

Name	Type	Description	Default
`records`	`Union[List[dict], List[Record], HFDataset]`	A list of `Record` objects, a Hugging Face Dataset, or a list of dictionaries representing the records. If records are defined as a dictionaries or a dataset, the keys/ column names should correspond to the fields in the Argilla dataset's fields and questions. `id` should be provided to identify the records when updating.	required
`mapping`	`Optional[Dict[str, Union[str, Sequence[str]]]]`	A dictionary that maps the keys/ column names in the records to the fields or questions in the Argilla dataset. To assign an incoming key or column to multiple fields or questions, provide a list or tuple of field or question names.	`None`
`user_id`	`Optional[UUID]`	The user id to be associated with the records' response. If not provided, the current user id is used.	`None`
`batch_size`	`int`	The number of records to send in each batch. The default is 256.	`DEFAULT_BATCH_SIZE`

Returns:

Type	Description
`DatasetRecords`	A list of Record objects representing the updated records.

Source code in src/argilla/records/_dataset_records.py

def log(
    self,
    records: Union[List[dict], List[Record], HFDataset],
    mapping: Optional[Dict[str, Union[str, Sequence[str]]]] = None,
    user_id: Optional[UUID] = None,
    batch_size: int = DEFAULT_BATCH_SIZE,
    on_error: RecordErrorHandling = RecordErrorHandling.RAISE,
) -> "DatasetRecords":
    """Add or update records in a dataset on the server using the provided records.
    If the record includes a known `id` field, the record will be updated.
    If the record does not include a known `id` field, the record will be added as a new record.
    See `rg.Record` for more information on the record definition.

    Parameters:
        records: A list of `Record` objects, a Hugging Face Dataset, or a list of dictionaries representing the records.
                 If records are defined as a dictionaries or a dataset, the keys/ column names should correspond to the
                 fields in the Argilla dataset's fields and questions. `id` should be provided to identify the records when updating.
        mapping: A dictionary that maps the keys/ column names in the records to the fields or questions in the Argilla dataset.
                 To assign an incoming key or column to multiple fields or questions, provide a list or tuple of field or question names.
        user_id: The user id to be associated with the records' response. If not provided, the current user id is used.
        batch_size: The number of records to send in each batch. The default is 256.

    Returns:
        A list of Record objects representing the updated records.
    """
    record_models = self._ingest_records(
        records=records, mapping=mapping, user_id=user_id or self.__client.me.id, on_error=on_error
    )
    batch_size = self._normalize_batch_size(
        batch_size=batch_size,
        records_length=len(record_models),
        max_value=self._api.MAX_RECORDS_PER_UPSERT_BULK,
    )

    created_or_updated = []
    records_updated = 0

    for batch in tqdm(
        iterable=range(0, len(records), batch_size),
        desc="Sending records...",
        total=len(records) // batch_size,
        unit="batch",
    ):
        self._log_message(message=f"Sending records from {batch} to {batch + batch_size}.")
        batch_records = record_models[batch : batch + batch_size]
        models, updated = self._api.bulk_upsert(dataset_id=self.__dataset.id, records=batch_records)
        created_or_updated.extend([Record.from_model(model=model, dataset=self.__dataset) for model in models])
        records_updated += updated

    records_created = len(created_or_updated) - records_updated
    self._log_message(
        message=f"Updated {records_updated} records and added {records_created} records to dataset {self.__dataset.name}",
        level="info",
    )

    return self

`delete(records, batch_size=DEFAULT_DELETE_BATCH_SIZE)` ¶

Delete records in a dataset on the server using the provided records and matching based on the id.

Parameters:

Name	Type	Description	Default
`records`	`List[Record]`	A list of `Record` objects representing the records to be deleted.	required
`batch_size`	`int`	The number of records to send in each batch. The default is 64.	`DEFAULT_DELETE_BATCH_SIZE`

Returns:

Type	Description
`List[Record]`	A list of Record objects representing the deleted records.

Source code in src/argilla/records/_dataset_records.py

def delete(
    self,
    records: List[Record],
    batch_size: int = DEFAULT_DELETE_BATCH_SIZE,
) -> List[Record]:
    """Delete records in a dataset on the server using the provided records
        and matching based on the id.

    Parameters:
        records: A list of `Record` objects representing the records to be deleted.
        batch_size: The number of records to send in each batch. The default is 64.

    Returns:
        A list of Record objects representing the deleted records.

    """
    mapping = None
    user_id = self.__client.me.id
    record_models = self._ingest_records(records=records, mapping=mapping, user_id=user_id)
    batch_size = self._normalize_batch_size(
        batch_size=batch_size,
        records_length=len(record_models),
        max_value=self._api.MAX_RECORDS_PER_DELETE_BULK,
    )

    records_deleted = 0
    for batch in tqdm(
        iterable=range(0, len(records), batch_size),
        desc="Sending records...",
        total=len(records) // batch_size,
        unit="batch",
    ):
        self._log_message(message=f"Sending records from {batch} to {batch + batch_size}.")
        batch_records = record_models[batch : batch + batch_size]
        self._api.delete_many(dataset_id=self.__dataset.id, records=batch_records)
        records_deleted += len(batch_records)

    self._log_message(
        message=f"Deleted {len(record_models)} records from dataset {self.__dataset.name}",
        level="info",
    )

    return records

`to_dict(flatten=False, orient='names')` ¶

Return the records as a dictionary. This is a convenient shortcut for dataset.records(...).to_dict().

Parameters:

Name	Type	Description	Default
`flatten`	`bool`	The structure of the exported dictionary. - True: The record fields, metadata, suggestions and responses will be flattened. - False: The record fields, metadata, suggestions and responses will be nested.	`False`
`orient`	`str`	The orientation of the exported dictionary. - "names": The keys of the dictionary will be the names of the fields, metadata, suggestions and responses. - "index": The keys of the dictionary will be the id of the records.	`'names'`

Returns: A dictionary of records.

Source code in src/argilla/records/_dataset_records.py

def to_dict(self, flatten: bool = False, orient: str = "names") -> Dict[str, Any]:
    """
    Return the records as a dictionary. This is a convenient shortcut for dataset.records(...).to_dict().

    Parameters:
        flatten (bool): The structure of the exported dictionary.
            - True: The record fields, metadata, suggestions and responses will be flattened.
            - False: The record fields, metadata, suggestions and responses will be nested.
        orient (str): The orientation of the exported dictionary.
            - "names": The keys of the dictionary will be the names of the fields, metadata, suggestions and responses.
            - "index": The keys of the dictionary will be the id of the records.
    Returns:
        A dictionary of records.

    """
    return self().to_dict(flatten=flatten, orient=orient)

`to_list(flatten=False)` ¶

Return the records as a list of dictionaries. This is a convenient shortcut for dataset.records(...).to_list().

Parameters:

Name	Type	Description	Default
`flatten`	`bool`	The structure of the exported dictionaries in the list. - True: The record keys are flattened and a dot notation is used to record attributes and their attributes . For example, `label.suggestion` and `label.response`. Records responses are spread across multiple columns for values and users. - False: The record fields, metadata, suggestions and responses will be nested dictionary with keys for record attributes.	`False`

Returns: A list of dictionaries of records.

Source code in src/argilla/records/_dataset_records.py

def to_list(self, flatten: bool = False) -> List[Dict[str, Any]]:
    """
    Return the records as a list of dictionaries. This is a convenient shortcut for dataset.records(...).to_list().

    Parameters:
        flatten (bool): The structure of the exported dictionaries in the list.
            - True: The record keys are flattened and a dot notation is used to record attributes and their attributes . For example, `label.suggestion` and `label.response`. Records responses are spread across multiple columns for values and users.
            - False: The record fields, metadata, suggestions and responses will be nested dictionary with keys for record attributes.
    Returns:
        A list of dictionaries of records.
    """
    data = self().to_list(flatten=flatten)
    return data

`to_json(path)` ¶

Export the records to a file on disk.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the file to save the records.	required

Returns:

Type	Description
`Path`	The path to the file where the records were saved.

Source code in src/argilla/records/_dataset_records.py

def to_json(self, path: Union[Path, str]) -> Path:
    """
    Export the records to a file on disk.

    Parameters:
        path (str): The path to the file to save the records.

    Returns:
        The path to the file where the records were saved.

    """
    return self().to_json(path=path)

`from_json(path)` ¶

Creates a DatasetRecords object from a disk path to a JSON file. The JSON file should be defined by DatasetRecords.to_json.

Parameters:

Name	Type	Description	Default
`path`	`str`	The path to the file containing the records.	required

Returns:

Name	Type	Description
`DatasetRecords`	`List[Record]`	The DatasetRecords object created from the disk path.

Source code in src/argilla/records/_dataset_records.py

def from_json(self, path: Union[Path, str]) -> List[Record]:
    """Creates a DatasetRecords object from a disk path to a JSON file.
        The JSON file should be defined by `DatasetRecords.to_json`.

    Args:
        path (str): The path to the file containing the records.

    Returns:
        DatasetRecords: The DatasetRecords object created from the disk path.

    """
    records = JsonIO._records_from_json(path=path)
    return self.log(records=records)

`to_datasets()` ¶

Export the records to a HFDataset.

Returns:

Type	Description
`HFDataset`	The dataset containing the records.

Source code in src/argilla/records/_dataset_records.py

def to_datasets(self) -> HFDataset:
    """
    Export the records to a HFDataset.

    Returns:
        The dataset containing the records.

    """

    return self().to_datasets()

rg.Dataset.records¶

Usage Examples¶

Adding records to a dataset¶

Updating records in a dataset¶

Adding and updating records with images¶

Iterating over records in a dataset¶

DatasetRecords ¶

__init__(client, dataset, mapping=None) ¶

__call__(query=None, batch_size=DEFAULT_BATCH_SIZE, start_offset=0, with_suggestions=True, with_responses=True, with_vectors=None, limit=None) ¶

log(records, mapping=None, user_id=None, batch_size=DEFAULT_BATCH_SIZE, on_error=RecordErrorHandling.RAISE) ¶

delete(records, batch_size=DEFAULT_DELETE_BATCH_SIZE) ¶

to_dict(flatten=False, orient='names') ¶

to_list(flatten=False) ¶

to_json(path) ¶

from_json(path) ¶

to_datasets() ¶

`rg.Dataset.records`¶

`DatasetRecords` ¶

`init(client, dataset, mapping=None)` ¶

`call(query=None, batch_size=DEFAULT_BATCH_SIZE, start_offset=0, with_suggestions=True, with_responses=True, with_vectors=None, limit=None)` ¶

`log(records, mapping=None, user_id=None, batch_size=DEFAULT_BATCH_SIZE, on_error=RecordErrorHandling.RAISE)` ¶

`delete(records, batch_size=DEFAULT_DELETE_BATCH_SIZE)` ¶

`to_dict(flatten=False, orient='names')` ¶

`to_list(flatten=False)` ¶

`to_json(path)` ¶

`from_json(path)` ¶

`to_datasets()` ¶