🕵🏻‍♀️ Create a RAG system expert in a GitHub repository and log your predictions in Argilla¶

In this tutorial, we'll show you how to create a RAG system that can answer questions about a specific GitHub repository. As example, we will target the Argilla repository. This RAG system will target the docs of the repository, as that's where most of the natural language information about the repository can be found.

This tutorial includes the following steps:

Setting up the Argilla callback handler for LlamaIndex.
Initializing a GitHub client
Creating an index with a specific set of files from the GitHub repository of our choice.
Create a RAG system out of the Argilla repository, ask questions, and automatically log the answers to Argilla.

This tutorial is based on the Github Repository Reader made by LlamaIndex.

Getting started¶

Deploy the Argilla server¶¶

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.

Set up the environment¶¶

To complete this tutorial, you need to install this integration and a third-party library via pip.

Note

Check the integration GitHub repository here.

!pip install "argilla-llama-index"
!pip install "llama-index-readers-github==0.1.9"

Let's make the required imports:

from llama_index.core import (
    Settings,
    VectorStoreIndex,
)
from llama_index.core.instrumentation import get_dispatcher
from llama_index.llms.openai import OpenAI
from llama_index.readers.github import (
    GithubClient,
    GithubRepositoryReader,
)

from argilla_llama_index import ArgillaHandler

We need to set the OpenAI API key and the GitHub token. The OpenAI API key is required to run queries using GPT models, while the GitHub token ensures you have access to the repository you're using. Although the GitHub token might not be necessary for public repositories, it is still recommended.

import os

os.environ["OPENAI_API_KEY"] = "sk-..."
openai_api_key = os.getenv("OPENAI_API_KEY")

os.environ["GITHUB_TOKEN"] = "ghp_..."
github_token = os.getenv("GITHUB_TOKEN")

Set the Argilla's LlamaIndex handler¶

To easily log your data into Argilla within your LlamaIndex workflow, you only need to initialize the Argilla handler and attach it to the Llama Index dispatcher for spans and events. This ensures that the predictions obtained using Llama Index are automatically logged to the Argilla instance, along with the useful metadata.

dataset_name: The name of the dataset. If the dataset does not exist, it will be created with the specified name. Otherwise, it will be updated.
api_url: The URL to connect to the Argilla instance.
api_key: The API key to authenticate with the Argilla instance.
number_of_retrievals: The number of retrieved documents to be logged. Defaults to 0.
workspace_name: The name of the workspace to log the data. By default, the first available workspace.

> For more information about the credentials, check the documentation for users and workspaces.

argilla_handler = ArgillaHandler(
    dataset_name="github_query_llama_index",
    api_url="http://localhost:6900",
    api_key="argilla.apikey",
    number_of_retrievals=2,
)
root_dispatcher = get_dispatcher()
root_dispatcher.add_span_handler(argilla_handler)
root_dispatcher.add_event_handler(argilla_handler)

Retrieve the data from GitHub¶

First, we need to initialize the GitHub client, which will include the GitHub token for repository access.

github_client = GithubClient(github_token=github_token, verbose=True)

Before creating our GithubRepositoryReader instance, we need to adjust the nesting. Since the Jupyter kernel operates on an event loop, we must prevent this loop from finishing before the repository is fully read.

import nest_asyncio

nest_asyncio.apply()

Now, let’s create a GithubRepositoryReader instance with the necessary repository details. In this case, we'll target the main branch of the argilla repository. As we will focus on the documentation, we will focus on the argilla/docs/ folder, excluding images, json files, and ipynb files.

documents = GithubRepositoryReader(
    github_client=github_client,
    owner="argilla-io",
    repo="argilla",
    use_parser=False,
    verbose=False,
    filter_directories=(
        ["argilla/docs/"],
        GithubRepositoryReader.FilterType.INCLUDE,
    ),
    filter_file_extensions=(
        [
            ".png",
            ".jpg",
            ".jpeg",
            ".gif",
            ".svg",
            ".ico",
            ".json",
            ".ipynb",   # Erase this line if you want to include notebooks

        ],
        GithubRepositoryReader.FilterType.EXCLUDE,
    ),
).load_data(branch="main")

Create the index and make some queries¶

Now, let's create a LlamaIndex index out of this document, and we can start querying the RAG system.

# LLM settings
Settings.llm = OpenAI(
    model="gpt-3.5-turbo", temperature=0.8, openai_api_key=openai_api_key
)

# Load the data and create the index
index = VectorStoreIndex.from_documents(documents)

# Create the query engine
query_engine = index.as_query_engine()

response = query_engine.query("How do I create a Dataset in Argilla?")
response

The generated response will be automatically logged in our Argilla instance. Check it out! From Argilla you can quickly have a look at your predictions and annotate them, so you can combine both synthetic data and human feedback.

Argilla UI

Let's ask a couple of more questions to see the overall behavior of the RAG chatbot. Remember that the answers are automatically logged into your Argilla instance.

questions = [
    "How can I list the available datasets?",
    "Which are the user credentials?",
    "Can I use markdown in Argilla?",
    "Could you explain how to annotate datasets in Argilla?",
]

answers = []

for question in questions:
    answers.append(query_engine.query(question))

for question, answer in zip(questions, answers):
    print(f"Question: {question}")
    print(f"Answer: {answer}")
    print("----------------------------")

Question: How can I list the available datasets?
Answer: You can list all the datasets available in a workspace by utilizing the `datasets` attribute of the `Workspace` class. Additionally, you can determine the number of datasets in a workspace by using `len(workspace.datasets)`. To list the datasets, you can iterate over them and print out each dataset. Remember that dataset settings are not preloaded when listing datasets, and if you need to work with settings, you must load them explicitly for each dataset.
----------------------------
Question: Which are the user credentials?
Answer: The user credentials in Argilla consist of a username, password, and API key.
----------------------------
Question: Can I use markdown in Argilla?
Answer: Yes, you can use Markdown in Argilla.
----------------------------
Question: Could you explain how to annotate datasets in Argilla?
Answer: To annotate datasets in Argilla, users can manage their data annotation projects by setting up `Users`, `Workspaces`, `Datasets`, and `Records`. By deploying Argilla on the Hugging Face Hub or with `Docker`, installing the Python SDK with `pip`, and creating the first project, users can get started in just 5 minutes. The tool allows for interacting with data in a more engaging way through features like quick labeling with filters, AI feedback suggestions, and semantic search, enabling users to focus on training models and monitoring their performance effectively.
----------------------------