Collecting RLHF data#

Argilla Feedback assists in three critical stages of the LLM fine-tuning process. The first is the collection of demonstration data for supervised fine-tuning of large language models. This stage, while a part of the RLHF process, also operates independently. In supervised fine-tuning, models learn from human-guided examples, steering them, and improving their capabilities.

The second stage where Argilla Feedback proves beneficial is in the collection of comparison data, a key element for training a reward model for RLHF.

Similarly, Argilla Feedback can be used to write or select prompts for the last stage: Reinforcement learning. This collection process is highly similar to the first stage except that we don’t ask users to write demonstrations.

The following figure shows the stages for training and fine-tuning LLMs. From top to bottom, it shows, the data needed at each stage (note the color for the data collected with human feedback), the stage (namely, pre-training, supervised fine-tuning, reward modeling, and reinforcement learning), and finally the model created at each stage. Argilla Feedback makes the process of collecting human feedback seamless at each step after the pre-training stage.

Note

This guide uses terminology from the InstructGPT paper and the amazing introduction to RLHF by Chip Huyen. The above figure is an adaptation of Chip Huyen’s figure.

To understand how Argilla Feedback works, let’s deep-dive into the Collecting demonstration data and Collecting comparison data stages.

Collecting demonstration data#

When training large language models, the collection of demonstration data plays a significant role. This data, consisting of prompts and demonstrations, is used in supervised fine-tuning where models learn to generate responses to prompts based on human-provided examples. Other common names for this stage are instruction-tuning and behavior cloning. Although this is commonly identified as a labor-intensive stage, recent research, like the LIMA work, curating a small set of 1,000 high-quality and diverse examples can effectively teach a model to follow instructions. Argilla Feedback is designed to simplify and distribute this process across many labelers within your organization.

Consider this example. Your company has access to a database of prompts, perhaps from an existing database like ShareGPT or from your internal resources, and you aim to fine-tune a model to respond accurately to these prompts. Using Argilla Feedback, you can efficiently distribute the prompts to human labelers who then write responses as if they were the model. This set of model responses becomes a key component of the supervised fine-tuning process.

Tip

You can add unlimited users to your Argilla instance and datasets so Argilla can be seamlessly used to distribute the workload among hundreds of labelers or experts within your organization. Similar efforts include Dolly from Databricks or OpenAssistant. If you’d like help setting up such an effort, reach out to us and we’ll gladly help out.

To visualize the desired effect of supervised fine-tuning, let’s see two examples generated by LLMs. First, the model Falcon-7B is not fine-tuned for the following instructions:

Prompt	Completion
Write a follow-up for a sales email. Email:	I was just curious if you had any questions about the project I discussed with you on the phone yesterday. Please let me know if I can help you with anything else, or any further details that will help you move forward with your project. Thanks again for your time, and have a great weekend, -Sarah

And the same model after instruction-tuning, the Falcon-7B-instruct model:

Prompt	Completion
Write a follow-up for a sales email	Dear [Customer Name], Thank you for purchasing [Product Name] from us last week. We hope you have been enjoying using it! If you have any questions or feedback about your purchase, please do not hesitate to let us know. We are always happy to help. Best regards, [Sales Team]

Moreover, Argilla Feedback can be used to collect prompts as well. In this case, you would ask the labelers not only to provide a response to a given prompt but also to write the prompts themselves. This dual functionality enhances the diversity of your dataset and can lead to more robust model performance.

Additionally, you may choose to gather extra feedback on the prompts. For example, you could ask labelers to rate the clarity or relevance of each prompt or to provide general comments in natural language. This auxiliary information can be invaluable for refining your prompts and guiding the assessment and training processes.

Tip

Beyond instruction-tuning, curating demonstration data is an important step for aligning the model with certain values and reducing its toxicity. An important related work is “Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets”. In this work, researchers improve language model behavior by fine-tuning a curated dataset of <100 examples of prompts and values-aligned responses. If you’d like help setting up such an effort, reach out to us and we’ll gladly help out.

Collecting comparison data#

The key behind the success of ChatGPT by OpenAI or Claude by Anthropic is the application of a third stage. This third stage uses Reinforcement Learning to steer and align the model with human preferences. The most well-known technique for this stage is called RLHF.

Note

There are other, potentially complementary, approaches like Reinforcement Learning From AI Feedback, but we strongly believe that fine-tuning LLMs with humans in the loop is key to building robust, responsible, and safe models.

RLHF consists of two main steps:

Training a reward model. This model’s purpose is to assign higher scores to responses that humans would prefer.
Optimizing the LLM with RL to generate responses that the reward model rates highly.

The reward model is designed to assign a score to a pair consisting of a prompt and response. However, the process of collecting comparison data operates slightly differently. Usually, comparison data collection entails having humans rank several responses to a specific prompt, listing them from best to worst.

Consider this example. Your company has access to an instruction-following model, after going through the supervised fine-tuning stage or reusing an open-source instruction-following model. After an internal evaluation process, the model shows undesired behaviors like generating made-up facts (sometimes referred to as “hallucinations”), harmful content, or just unhelpful responses. This is where a second stage of alignment with human preferences becomes relevant.

Tip

You can use Argilla Feedback for the internal evaluation process by registering the interactions with the model and asking labelers to rate the quality of the responses. If you’d like help setting up such an effort, reach out to us, and will gladly help with the setup.

With Argilla, you can seamlessly create a feedback collection procedure. This involves asking labelers to rank multiple model responses for a specific prompt. The comparison data gathered in this process can be utilized to train a reward model. This reward model has two key uses:

Evaluating the quality of a prompt-response pair,
Enhancing the model via Reinforcement Learning (RL).

In the next sections, we discuss how to collect demonstration and comparison data with Argilla.