Open In Colab  View Notebook on GitHub

๐Ÿ” Backup and version Argilla Datasets using DVC#

In this tutorial, we will show you how you can store and version your data using DVC. Alternatively, you can take a look at our Elasticsearch docs about creating retention snapshots directly from your Elasticsearch cluster. It will walk you through the following steps:

  • โš™๏ธ configure DVC

  • ๐Ÿง determine backup config

  • ๐Ÿงช test back-up config

Transformers Log Demo


It is important to be able to keep track and store data to version data used in training cycles and to avoid losing data. DVC creates a reference to your data and stores it within an external storage repo. Pushing this reference to get allows us to reproduce certain stages of your repository, while also having a copy of the exact data that was in the repo during that exact time. Think โ€œgit for dataโ€.

Take a look at the DVC docs to get a bit more familiar with the idea behind this versioning principle.

Letโ€™s get started!

Running Argilla#

For this tutorial, you will need to have an Argilla server running. There are two main options for deploying and running Argilla:

Deploy Argilla on Hugging Face Spaces: If you want to run tutorials with external notebooks (e.g., Google Colab) and you have an account on Hugging Face, you can deploy Argilla on Spaces with a few clicks:

deploy on spaces

For details about configuring your deployment, check the official Hugging Face Hub guide.

Launch Argilla using Argillaโ€™s quickstart Docker image: This is the recommended option if you want Argilla running on your local machine. Note that this option will only let you run the tutorial locally and not with an external notebook service.

For more information on deployment options, please check the Deployment section of the documentation.


This tutorial is a Jupyter Notebook. There are two options to run it:

  • Use the Open in Colab button at the top of this page. This option allows you to run the notebook directly on Google Colab. Donโ€™t forget to change the runtime type to GPU for faster model training and inference.

  • Download the .ipynb file by clicking on the View source link at the top of the page. This option allows you to download the notebook and run it on your local machine or on a Jupyter notebook tool of your choice.

Setup packages#

To complete this tutorial, you will need to install the Argilla client and DVC.

[ ]:
%pip install argilla -qqq
[ ]:
!brew install dvc # mac
# !snap install --classic dvc # linux
# !choco install dvc # windows

Letโ€™s import the Argilla module for reading and writing data:

[ ]:
import argilla as rg

If you are running Argilla using the Docker quickstart image or public Hugging Face Spaces, you need to init the Argilla client with the URL and API_KEY:

[ ]:
# Replace api_url with the url to your HF Spaces URL if using Spaces
# Replace api_key if you configured a custom API key

If youโ€™re running a private Hugging Face Space, you will also need to set the HF_TOKEN as follows:

[ ]:
# # Set the HF_TOKEN environment variable
# import os
# os.environ['HF_TOKEN'] = "your-hf-token"

# # Replace api_url with the url to your HF Spaces URL
# # Replace api_key if you configured a custom API key
# rg.init(
#     api_url="https://https://[your-owner-name]-[your_space_name]",
#     api_key="admin.apikey",
#     extra_headers={"Authorization": f"Bearer {os.environ['HF_TOKEN']}"},
# )

Finally, letโ€™s include the imports we need:

[ ]:
import datetime
import os
import glob
import time
from typing import List

Enable Telemetry#

We gain valuable insights from how you interact with our tutorials. To improve ourselves in offering you the most suitable content, using the following lines of code will help us understand that this tutorial is serving you effectively. Though this is entirely anonymous, you can choose to skip this step if you prefer. For more info, please check out the Telemetry page.

[ ]:
    from argilla.utils.telemetry import tutorial_running
except ImportError:
    print("Telemetry is introduced in Argilla 1.20.0 and not found in the current installation. Skipping telemetry.")

Configure .git#

We will use GitHub as a way to track our stored files. This requires us to link our directory to a git remote. We assume that the environment already has set up the correct git credentials and that it is linked to a .git file. This can be tested with git remote -v.

[ ]:
!git remote -v

Configure DVC#

We will first initialize our DVC repo, which will automatically be linked to our git remote.

!dvc init
Initialized DVC repository.

You can now commit the changes to git.

|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <>              |
|                                                                     |

What's next?
- Check out the documentation: <>
- Get help and share ideas: <>
- Star us on GitHub: <>

Next, we assume that DVC will be used in combination with Google Drive as remote storage. Other options are available but configuring Google Drive is the most straightforward approach. This needs to be configured by adding something similar as shown below, where <your-gdrive-folder-id> is replaced with the Google Drive folder you would like to use for storage. Alternatively, you can go to their configuration page.

[ ]:
!dvc remote add myremote gdrive://<your-gdrive-folder-id>
[ ]:
!dvc remote default myremote

Additionally, we set autostaging for dvc, which also automatically commits them to git.

!dvc config core.autostage true

Define Background Process#

After setting up DVC, we can now define a function to collect and store data. This will follow the following steps: - Export data using a naming convention /data/YY-mm-dd_dataset - (optional) create /data_descriptions to add to GitHub - Add the data to DVC, creating a .dvc reference to the /data/* - Commit the .dvc reference to GitHub - push the /data/* to DVC and push the .dvc to GitHub

This kind of versioning allows us to explore data in GitHub by using git checkout first (to switch a branch or checkout a .dvc file version) and then run dvc checkout to sync data.


def dataset_backupper(datasets: List[str], duration: int=60*60*24): while True: # load datasets and save as .pkl files for dataset_name in datasets: ds = rg.load(dataset_name) df = ds.to_pandas() df.to_pickle(f"data/{dataset_name}.pkl") # get all .pkl files using glob files = glob.glob('data/*.pkl', recursive=True) [os.system(f'dvc add {file}') for file in files] # push all .pkl.dvc files to github via git push os.system("dvc push") os.system("git commit -m 'update DVC files'") os.system("git push") time.sleep(duration) dataset_backupper(["argilla-dvc"])
Everything is up to date.
[WARNING] Unstaged files detected.
[INFO] Stashing unstaged files to /Users/davidberenstein/.cache/pre-commit/patch1673435632-44248.
check yaml...........................................(no files to check)Skipped
fix end of files.....................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
Insert license header in Python source files.........(no files to check)Skipped
black................................................(no files to check)Skipped
isort................................................(no files to check)Skipped
[INFO] Restored changes from /Users/davidberenstein/.cache/pre-commit/patch1673435632-44248.
[docs/tutorial-on-dvc-usage c86cb402] update DVC files
 4 files changed, 10 insertions(+)
 create mode 100644 docs/_source/tutorials/notebooks/.dvc/.gitignore
 create mode 100644 docs/_source/tutorials/notebooks/.dvc/config
 create mode 100644 docs/_source/tutorials/notebooks/.dvcignore
 create mode 100644 docs/_source/tutorials/notebooks/data/zingg.pkl.dvc
   2ea0912d..c86cb402  docs/tutorial-on-dvc-usage -> docs/tutorial-on-dvc-usage

This is just a toy example but it is highly configurable depending on your needs. Think about: - only backing up records that are more than X days old - deleting records after backing them up - separating backups per time period - add model versioning into the mix

Be creative and have some fun while doing it ๐Ÿค“

Retrieve data versions#

Next, we can explore data based on our git commit hashes. git checkout <commit> opens a previous commit, along with the corresponding *.dvc references. Next, we can use dvc pull to fetch and checkout the data files, that were present during the specific <commit>.


In this tutorial, we learned a bit about DVC and how this cool package might be used to back-up and version data within the Argilla ecosystem. This can help to preserve data and keep a clean overview of your data and model history.