rg.Dataset.records
¶
Usage Examples¶
In most cases, you will not need to create a DatasetRecords
object directly. Instead, you can access it via the Dataset
object:
For user familiar with legacy approaches
Dataset.records
object is used to interact with the records in a dataset. It interactively fetches records from the server in batches without using a local copy of the records.- The
log
method ofDataset.records
is used to both add and update records in a dataset. If the record includes a knownid
field, the record will be updated. If the record does not include a knownid
field, the record will be added.
Adding records to a dataset¶
To add records to a dataset, use the log
method. Records can be added as dictionaries or as Record
objects. Single records can also be added as a dictionary or Record
.
You can also add records to a dataset by initializing a Record
object directly.
records = [
rg.Record(
fields={
"question": "Do you need oxygen to breathe?",
"answer": "Yes"
},
),
rg.Record(
fields={
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius"
},
),
] # (1)
dataset.records.log(records)
- This is an illustration of a definition. In a real world scenario, you would iterate over a data structure and create
Record
objects for each iteration.
data = [
{
"question": "Do you need oxygen to breathe?",
"answer": "Yes",
},
{
"question": "What is the boiling point of water?",
"answer": "100 degrees Celsius",
},
] # (1)
dataset.records.log(data)
- The data structure's keys must match the fields or questions in the Argilla dataset. In this case, there are fields named
question
andanswer
.
data = [
{
"query": "Do you need oxygen to breathe?",
"response": "Yes",
},
{
"query": "What is the boiling point of water?",
"response": "100 degrees Celsius",
},
] # (1)
dataset.records.log(
records=data,
mapping={"query": "question", "response": "answer"} # (2)
)
- The data structure's keys must match the fields or questions in the Argilla dataset. In this case, there are fields named
question
andanswer
. - The data structure has keys
query
andresponse
and the Argilla dataset hasquestion
andanswer
. You can use themapping
parameter to map the keys in the data structure to the fields in the Argilla dataset.
You can also add records to a dataset using a Hugging Face dataset. This is useful when you want to use a dataset from the Hugging Face Hub and add it to your Argilla dataset.
You can add the dataset where the column names correspond to the names of fields, questions, metadata or vectors in the Argilla dataset.
If the dataset's schema does not correspond to your Argilla dataset names, you can use a mapping
to indicate which columns in the dataset correspond to the Argilla dataset fields.
from datasets import load_dataset
hf_dataset = load_dataset("imdb", split="train[:100]") # (1)
dataset.records.log(records=hf_dataset)
- In this example, the Hugging Face dataset matches the Argilla dataset schema. If that is not the case, you could use the
.map
of thedatasets
library to prepare the data before adding it to the Argilla dataset.
Here we use the mapping
parameter to specify the relationship between the Hugging Face dataset and the Argilla dataset.
- In this case, the
txt
key in the Hugging Face dataset corresponds to thetext
field in the Argilla dataset, and they
key in the Hugging Face dataset corresponds to thelabel
field in the Argilla dataset.
Updating records in a dataset¶
Records can also be updated using the log
method with records that contain an id
to identify the records to be updated. As above, records can be added as dictionaries or as Record
objects.
You can update records in a dataset by initializing a Record
object directly and providing the id
field.
records = [
rg.Record(
metadata={"department": "toys"},
id="2" # (1)
),
]
dataset.records.log(records)
- The
id
field is required to identify the record to be updated. Theid
field must be unique for each record in the dataset. If theid
field is not provided, the record will be added as a new record.
You can also update records in a dataset by providing the id
field in the data structure.
- The
id
field is required to identify the record to be updated. Theid
field must be unique for each record in the dataset. If theid
field is not provided, the record will be added as a new record.
You can also update records in a dataset by providing the id
field in the data structure and using a mapping to map the keys in the data structure to the fields in the dataset.
data = [
{
"metadata": {"department": "toys"},
"my_id": "2" # (1)
},
]
dataset.records.log(
records=data,
mapping={"my_id": "id"} # (2)
)
- The
id
field is required to identify the record to be updated. Theid
field must be unique for each record in the dataset. If theid
field is not provided, the record will be added as a new record. - Let's say that your data structure has keys
my_id
instead ofid
. You can use themapping
parameter to map the keys in the data structure to the fields in the dataset.
You can also update records to an Argilla dataset using a Hugging Face dataset. To update records, the Hugging Face dataset must contain an id
field to identify the records to be updated, or you can use a mapping to map the keys in the Hugging Face dataset to the fields in the Argilla dataset.
from datasets import load_dataset
hf_dataset = load_dataset("imdb", split="train[:100]") # (1)
dataset.records.log(records=hf_dataset, mapping={"uuid": "id"}) # (2)
- In this example, the Hugging Face dataset matches the Argilla dataset schema.
- The
uuid
key in the Hugging Face dataset corresponds to theid
field in the Argilla dataset.
Adding and updating records with images¶
Argilla datasets can contain image fields. You can add images to a dataset by passing the image to the record object as either a remote URL, a local path to an image file, or a PIL object. The field names must be defined as an rg.ImageField
in the dataset's Settings
object to be accepted. Images will be stored in the Argilla database and returned using the data URI schema.
As PIL objects
To retrieve the images as rescaled PIL objects, you can use the to_datasets
method when exporting the records, as shown in this how-to guide.
import os
from PIL import Image
image_dir = "path/to/images"
data = [
{
"image": os.path.join(image_dir, "image1.jpg"), # (1)
},
{
"image": Image.open(os.path.join(image_dir, "image2.jpg")), # (2)
},
]
dataset.records.log(data)
- The image is a local file path.
- The image is a PIL object.
Hugging Face datasets can be passed directly to the log
method. The image field must be defined as an Image
in the dataset's features.
hf_dataset = load_dataset("ylecun/mnist", split="train[:100]")
dataset.records.log(records=hf_dataset)
If the image field is not defined as an Image
in the dataset's features, you can cast the dataset to the correct schema before adding it to the Argilla dataset. This is only necessary if the image field is not defined as an Image
in the dataset's features, and is not one of the supported image types by Argilla (URL, local path, or PIL object).
hf_dataset = load_dataset("<my_custom_dataset>") # (1)
hf_dataset = hf_dataset.cast(
features=Features({"image": Image(), "label": Value("string")}),
)
dataset.records.log(records=hf_dataset)
- In this example, the Hugging Face dataset matches the Argilla dataset schema but the image field is not defined as an
Image
in the dataset's features.
Iterating over records in a dataset¶
Dataset.records
can be used to iterate over records in a dataset from the server. The records will be fetched in batches from the server::
for record in dataset.records:
print(record)
# Fetch records with suggestions and responses
for record in dataset.records(with_suggestions=True, with_responses=True):
print(record.suggestions)
print(record.responses)
# Filter records by a query and fetch records with vectors
for record in dataset.records(query="capital", with_vectors=True):
print(record.vectors)
Check out the rg.Record
class reference for more information on the properties and methods available on a record and the rg.Query
class reference for more information on the query syntax.
DatasetRecords
¶
Bases: Iterable[Record]
, LoggingMixin
This class is used to work with records from a dataset and is accessed via Dataset.records
.
The responsibility of this class is to provide an interface to interact with records in a dataset,
by adding, updating, fetching, querying, deleting, and exporting records.
Attributes:
Name | Type | Description |
---|---|---|
client |
Argilla
|
The Argilla client object. |
dataset |
Dataset
|
The dataset object. |
Source code in src/argilla/records/_dataset_records.py
156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 |
|
__init__(client, dataset, mapping=None)
¶
Initializes a DatasetRecords object with a client and a dataset. Args: client: An Argilla client object. dataset: A Dataset object.
Source code in src/argilla/records/_dataset_records.py
__call__(query=None, batch_size=DEFAULT_BATCH_SIZE, start_offset=0, with_suggestions=True, with_responses=True, with_vectors=None, limit=None)
¶
Returns an iterator over the records in the dataset on the server.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query |
Optional[Union[str, Query]]
|
A string or a Query object to filter the records. |
None
|
batch_size |
Optional[int]
|
The number of records to fetch in each batch. The default is 256. |
DEFAULT_BATCH_SIZE
|
start_offset |
int
|
The offset from which to start fetching records. The default is 0. |
0
|
with_suggestions |
bool
|
Whether to include suggestions in the records. The default is True. |
True
|
with_responses |
bool
|
Whether to include responses in the records. The default is True. |
True
|
with_vectors |
Optional[Union[List, bool, str]]
|
A list of vector names to include in the records. The default is None. If a list is provided, only the specified vectors will be included. If True is provided, all vectors will be included. |
None
|
limit |
Optional[int]
|
The maximum number of records to fetch. The default is None. |
None
|
Returns:
Type | Description |
---|---|
DatasetRecordsIterator
|
An iterator over the records in the dataset on the server. |
Source code in src/argilla/records/_dataset_records.py
log(records, mapping=None, user_id=None, batch_size=DEFAULT_BATCH_SIZE, on_error=RecordErrorHandling.RAISE)
¶
Add or update records in a dataset on the server using the provided records.
If the record includes a known id
field, the record will be updated.
If the record does not include a known id
field, the record will be added as a new record.
See rg.Record
for more information on the record definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
records |
Union[List[dict], List[Record], HFDataset]
|
A list of |
required |
mapping |
Optional[Dict[str, Union[str, Sequence[str]]]]
|
A dictionary that maps the keys/ column names in the records to the fields or questions in the Argilla dataset. To assign an incoming key or column to multiple fields or questions, provide a list or tuple of field or question names. |
None
|
user_id |
Optional[UUID]
|
The user id to be associated with the records' response. If not provided, the current user id is used. |
None
|
batch_size |
int
|
The number of records to send in each batch. The default is 256. |
DEFAULT_BATCH_SIZE
|
Returns:
Type | Description |
---|---|
DatasetRecords
|
A list of Record objects representing the updated records. |
Source code in src/argilla/records/_dataset_records.py
delete(records, batch_size=DEFAULT_DELETE_BATCH_SIZE)
¶
Delete records in a dataset on the server using the provided records and matching based on the id.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
records |
List[Record]
|
A list of |
required |
batch_size |
int
|
The number of records to send in each batch. The default is 64. |
DEFAULT_DELETE_BATCH_SIZE
|
Returns:
Type | Description |
---|---|
List[Record]
|
A list of Record objects representing the deleted records. |
Source code in src/argilla/records/_dataset_records.py
to_dict(flatten=False, orient='names')
¶
Return the records as a dictionary. This is a convenient shortcut for dataset.records(...).to_dict().
Parameters:
Name | Type | Description | Default |
---|---|---|---|
flatten |
bool
|
The structure of the exported dictionary. - True: The record fields, metadata, suggestions and responses will be flattened. - False: The record fields, metadata, suggestions and responses will be nested. |
False
|
orient |
str
|
The orientation of the exported dictionary. - "names": The keys of the dictionary will be the names of the fields, metadata, suggestions and responses. - "index": The keys of the dictionary will be the id of the records. |
'names'
|
Returns: A dictionary of records.
Source code in src/argilla/records/_dataset_records.py
to_list(flatten=False)
¶
Return the records as a list of dictionaries. This is a convenient shortcut for dataset.records(...).to_list().
Parameters:
Name | Type | Description | Default |
---|---|---|---|
flatten |
bool
|
The structure of the exported dictionaries in the list.
- True: The record keys are flattened and a dot notation is used to record attributes and their attributes . For example, |
False
|
Returns: A list of dictionaries of records.
Source code in src/argilla/records/_dataset_records.py
to_json(path)
¶
Export the records to a file on disk.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the file to save the records. |
required |
Returns:
Type | Description |
---|---|
Path
|
The path to the file where the records were saved. |
Source code in src/argilla/records/_dataset_records.py
from_json(path)
¶
Creates a DatasetRecords object from a disk path to a JSON file.
The JSON file should be defined by DatasetRecords.to_json
.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path |
str
|
The path to the file containing the records. |
required |
Returns:
Name | Type | Description |
---|---|---|
DatasetRecords |
List[Record]
|
The DatasetRecords object created from the disk path. |
Source code in src/argilla/records/_dataset_records.py
to_datasets()
¶
Export the records to a HFDataset.
Returns:
Type | Description |
---|---|
HFDataset
|
The dataset containing the records. |