๐Ÿ” Backup and version Argilla Datasets using DVC#

In this tutorial, we will show you how you can store and version your data using DVC. Alternatively, you can take a look at our Elasticsearch docs about creating retention snapshots directly from your Elasticsearch cluster. It will walk you through the following steps:

  • โš™๏ธ configure DVC

  • ๐Ÿง determine backup config

  • ๐Ÿงช test back-up config

Transformers Log Demo

Introduction#

It is important to be able to keep track and store data to version data used in training cycles and to avoid losing data. DVC creates a reference to your data and stores it within an external storage repo. Pushing this reference to get allows us to reproduce certain stages of your repository, while also having a copy of the exact data that was in the repo during that exact time. Think โ€œgit for dataโ€.

Take a look at the DVC docs to get a bit more familiar with the idea behind this versioning principle.

Letโ€™s get started!

Setup#

Apart from Argilla, weโ€™ll need a to install DVC.

[ ]:
!brew install dvc # mac
# !snap install --classic dvc # linux
# !choco install dvc # windows

Configure .git#

We will use GitHub as a way to track our stored files. This requires us to link our directory to a git remote. We assume that the environment already has set-up the correct git credentials and that it is linked to a .git file. This can be tested with git remote -v.

[6]:
!git remote -v
origin  https://github.com/argilla-io/argilla.git (fetch)
origin  https://github.com/argilla-io/argilla.git (push)

Configure DVC#

We will first initialize our DVC repo, which will automatically be linked to our git remote.

[8]:
!dvc init
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>

Next, we assume that DVC will be used in combination with Google Drive as remote storage. Other options are available but configuring Google Drive is the most strraightforward approach. This need to be configured by adding something similar as shown below, where <your-gdrive-folder-id> is replaced with the Google Drive folder you would like to use for storage. Alternatively, you can go to their configuration page.

[ ]:
!dvc remote add myremote gdrive://<your-gdrive-folder-id>
[ ]:
!dvc remote default myremote

Additionally, we set autostaging for dvc, which also automatically commits them to git.

[13]:
!dvc config core.autostage true

Define Background Process#

After setting up DVC, we can now define function to collect and store data. This will follow the following steps: - Export data using a naming convention /data/YY-mm-dd_dataset - (optional) create /data_descriptions to add to GitHub - Add the data to DVC, creating a .dvc reference to the /data/* - Commit the .dvc reference to GitHub - push the /data/* to DVC and push the .dvc to GitHub

This kind of versioning allows us to explore data in GitHub by using git checkout first (to switch a branch or checkout a .dvc file version) and then run dvc checkout to sync data.

[22]:
import argilla as rg
import datetime
import os
import glob
import time
from typing import List

import os
import argilla as rg

rg.init(api_url=os.environ.get("ARGILLA_API_URL_DEV"), api_key=os.environ.get("ARGILLA_API_KEY"))

def dataset_backupper(datasets: List[str], duration: int=60*60*24):
    while True:
        # load datasets and save as .pkl files
        for dataset_name in datasets:
            ds = rg.load(dataset_name)
            df = ds.to_pandas()
            df.to_pickle(f"data/{dataset_name}.pkl")

        # get all .pkl files using glob
        files = glob.glob('data/*.pkl', recursive=True)
        [os.system(f'dvc add {file}') for file in files]

        # push all .pkl.dvc files to github via git push
        os.system("dvc push")
        os.system("git commit -m 'update DVC files'")
        os.system("git push")

        time.sleep(duration)
dataset_backupper(["argilla-dvc"])
Everything is up to date.
[WARNING] Unstaged files detected.
[INFO] Stashing unstaged files to /Users/davidberenstein/.cache/pre-commit/patch1673435632-44248.
check yaml...........................................(no files to check)Skipped
fix end of files.....................................(no files to check)Skipped
trim trailing whitespace.................................................Passed
Insert license header in Python source files.........(no files to check)Skipped
black................................................(no files to check)Skipped
isort................................................(no files to check)Skipped
[INFO] Restored changes from /Users/davidberenstein/.cache/pre-commit/patch1673435632-44248.
[docs/tutorial-on-dvc-usage c86cb402] update DVC files
 4 files changed, 10 insertions(+)
 create mode 100644 docs/_source/tutorials/notebooks/.dvc/.gitignore
 create mode 100644 docs/_source/tutorials/notebooks/.dvc/config
 create mode 100644 docs/_source/tutorials/notebooks/.dvcignore
 create mode 100644 docs/_source/tutorials/notebooks/data/zingg.pkl.dvc
To https://github.com/argilla-io/argilla.git
   2ea0912d..c86cb402  docs/tutorial-on-dvc-usage -> docs/tutorial-on-dvc-usage

This is just a toy example but it is highly configurable depending on your needs. Think about, - only backing up records that are more than X days old - deleting records after backing them up - separating backups per time period - add model versioning into the mix

Be creative and have some fun while doing it ๐Ÿค“

Retrieve data versions#

Next, we can explore data based on our git commit hashes. git checkout <commit> opens a previous commit, along with the corresponding *.dvc references. Next, we can use dvc pull to fetch and checkout the data files, that were present during the specific <commit>.

Summary#

In this tutorial, we learned a bit about DVC and how this cool package might be used to back-up and version data within the Argilla ecosystem. This can help to preserve data and keep a clean overview of your data and model history.

Next steps#

โญ Argilla Github repo to stay updated.

๐Ÿ“š Argilla documentation for more guides and tutorials.

๐Ÿ™‹โ€โ™€๏ธ Join the Argilla community! A good place to start is the discussion forum.