🕵🏻♀️ Create a RAG system expert in a GitHub repository and log your predictions in Argilla¶
In this tutorial, we'll show you how to create a RAG system that can answer questions about a specific GitHub repository. As example, we will target the Argilla repository. This RAG system will target the docs of the repository, as that's where most of the natural language information about the repository can be found.
This tutorial includes the following steps:
- Setting up the Argilla callback handler for LlamaIndex.
- Initializing a GitHub client
- Creating an index with a specific set of files from the GitHub repository of our choice.
- Create a RAG system out of the Argilla repository, ask questions, and automatically log the answers to Argilla.
This tutorial is based on the Github Repository Reader made by LlamaIndex.
Getting started¶
Deploy the Argilla server¶¶
If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following this guide.
Let's make the required imports:
We need to set the OpenAI API key and the GitHub token. The OpenAI API key is required to run queries using GPT models, while the GitHub token ensures you have access to the repository you're using. Although the GitHub token might not be necessary for public repositories, it is still recommended.
Set the Argilla's LlamaIndex handler¶
To easily log your data into Argilla within your LlamaIndex workflow, you only need to initialize the Argilla handler and attach it to the Llama Index dispatcher for spans and events. This ensures that the predictions obtained using Llama Index are automatically logged to the Argilla instance, along with the useful metadata.
dataset_name
: The name of the dataset. If the dataset does not exist, it will be created with the specified name. Otherwise, it will be updated.api_url
: The URL to connect to the Argilla instance.api_key
: The API key to authenticate with the Argilla instance.number_of_retrievals
: The number of retrieved documents to be logged. Defaults to 0.workspace_name
: The name of the workspace to log the data. By default, the first available workspace.
> For more information about the credentials, check the documentation for users and workspaces.
Retrieve the data from GitHub¶
First, we need to initialize the GitHub client, which will include the GitHub token for repository access.
Before creating our GithubRepositoryReader
instance, we need to adjust the nesting. Since the Jupyter kernel operates on an event loop, we must prevent this loop from finishing before the repository is fully read.
Now, let’s create a GithubRepositoryReader instance with the necessary repository details. In this case, we'll target the main
branch of the argilla
repository. As we will focus on the documentation, we will focus on the argilla/docs/
folder, excluding images, json files, and ipynb files.
documents = GithubRepositoryReader(
github_client=github_client,
owner="argilla-io",
repo="argilla",
use_parser=False,
verbose=False,
filter_directories=(
["argilla/docs/"],
GithubRepositoryReader.FilterType.INCLUDE,
),
filter_file_extensions=(
[
".png",
".jpg",
".jpeg",
".gif",
".svg",
".ico",
".json",
".ipynb", # Erase this line if you want to include notebooks
],
GithubRepositoryReader.FilterType.EXCLUDE,
),
).load_data(branch="main")
Create the index and make some queries¶
Now, let's create a LlamaIndex index out of this document, and we can start querying the RAG system.
The generated response will be automatically logged in our Argilla instance. Check it out! From Argilla you can quickly have a look at your predictions and annotate them, so you can combine both synthetic data and human feedback.
Let's ask a couple of more questions to see the overall behavior of the RAG chatbot. Remember that the answers are automatically logged into your Argilla instance.
questions = [
"How can I list the available datasets?",
"Which are the user credentials?",
"Can I use markdown in Argilla?",
"Could you explain how to annotate datasets in Argilla?",
]
answers = []
for question in questions:
answers.append(query_engine.query(question))
for question, answer in zip(questions, answers):
print(f"Question: {question}")
print(f"Answer: {answer}")
print("----------------------------")