Generative AI

RLHF and DPO

Предварительные знания

Supervised fine-tuning (SFT) as the step before preference tuning
Basic idea of a loss function and gradient-based optimization

ChatGPT's qualitative leap over GPT-3 was not about parameter count or training data - it was RLHF. The same base model, fine-tuned with human preference data, became dramatically more helpful, harmless, and honest. OpenAI's "InstructGPT" paper (2022) showed that 1.3B parameter RLHF-trained models outperformed 175B base models on human preference evaluations.

Anthropic's Claude uses Constitutional AI (a variant of RLHF) where the model critiques and revises its own outputs against a set of principles before human raters evaluate. This reduces the volume of human preference data needed while improving safety alignment.
Meta's Llama 2-Chat was trained with RLHF using over 1 million human preference annotations. The reward model trained on these annotations was then used to generate millions of additional preference pairs via rejection sampling.
Google's Gemini models use a combination of RLHF and RLAIF (RL from AI Feedback) - the AI model itself evaluates response quality, supplementing human raters for scale. Pure human annotation cannot scale to the billions of samples needed for frontier models.

Teaching models what people prefer

In 2017 Paul Christiano and co-authors at OpenAI and DeepMind showed that an agent could learn complex goals from human preference comparisons rather than a hand-written reward, training a reward model from pairwise judgments. In 2022 Long Ouyang and the OpenAI team applied this to language models in InstructGPT: collect human preferences, train a reward model, optimize the policy with PPO. That pipeline became RLHF and powered ChatGPT. RLHF worked but was fragile and expensive, juggling four models at once. In 2023 Rafael Rafailov and colleagues at Stanford published Direct Preference Optimization, proving the same preference objective could be optimized directly with a simple classification-style loss, no separate reward model and no reinforcement learning loop. DPO made preference tuning accessible to teams without an RL infrastructure.

Reward Modeling

A reward model is a neural network that takes a prompt and a response and outputs a scalar score representing human preference quality. It is trained on comparison data: pairs of responses to the same prompt where human annotators indicated which they preferred. The reward model learns the annotation function - "what makes a response good according to humans?" - and can then evaluate millions of responses automatically, at a speed humans cannot match.

Reward model quality is the ceiling for RLHF alignment quality. A reward model that rewards verbose responses trains a verbose policy; one that rewards sycophancy trains a sycophantic policy. The "reward hacking" problem: the policy finds outputs that score high on the reward model but are not actually good - the reward model had a blind spot.

What does a reward model output for a given prompt-response pair?

PPO in RLHF

In RLHF, the language model is the policy: it generates responses (actions) given prompts (states) and receives rewards from the reward model. PPO (Proximal Policy Optimization) updates the policy to maximize expected reward while a KL divergence penalty prevents the policy from drifting too far from the supervised fine-tuned (SFT) reference model. Without the KL penalty, the policy collapses to generating reward-hacking outputs that score high but are degenerate.

PPO requires 4 models in memory simultaneously: the policy (being trained), the reference SFT model (for KL), the reward model (for scoring), and the value model (critic for advantage estimation). Training a 70B parameter model with RLHF requires GPU clusters with 512+ A100s. This computational cost is the primary motivation for DPO.

Why is a KL divergence penalty added to the PPO RLHF objective?

Direct Preference Optimization

DPO (Rafailov et al., 2023) is a supervised alternative to PPO-based RLHF that eliminates the need for a separate reward model and the complex RL training loop. DPO shows that the optimal RLHF policy has a closed-form relationship to the SFT model that can be directly optimized from preference pairs using a binary cross-entropy loss. In practice, DPO trains 3-5x faster than PPO and produces comparable or better alignment results on most benchmarks.

DPO variants proliferate: IPO (Identity Preference Optimization), KTO (Kahneman-Tversky Optimization, uses individual feedback rather than pairs), SimPO (no reference model). The trl library implements all of these. The field is moving rapidly - most LLM providers use some variant of DPO for preference alignment in 2024-2025.

What makes DPO simpler to train than PPO-based RLHF?

Preference Optimization in Practice

Preference optimization is not a one-time training step - it is an iterative cycle in production LLM development. The model generates responses, human or AI raters annotate preferences, a new preference dataset is created, DPO or PPO fine-tunes the model, and the cycle repeats. Each iteration requires careful data quality control: annotation disagreements, annotator fatigue, and demographic biases in annotators all shape the final model behavior.

Constitutional AI (Anthropic) uses an AI model to critique and revise its own outputs against a written constitution before human raters see them. This scales the quality signal: instead of rating raw outputs, humans rate post-critique revised outputs, which are already of higher quality. The constitution is the key innovation - it encodes principles rather than specific preferences.

What is "reward hacking" in the context of RLHF?

Key Ideas

**Reward modeling:** a learned function that scores outputs according to human preferences; trained on comparison pairs (response A vs B, which is better?)
**PPO (Proximal Policy Optimization):** the RL algorithm used in classic RLHF; optimizes the language model to maximize reward model score while staying close to the SFT reference model via KL penalty
**DPO (Direct Preference Optimization):** a simpler alternative to PPO that skips the reward model entirely and fine-tunes directly on preference pairs using a closed-form loss
**Preference optimization:** the broader field of aligning model outputs with human values through comparison data rather than scalar rewards

Вопросы для размышления

RLHF reward models are trained on comparisons by human annotators who are paid per annotation and often work quickly. What systematic biases does this introduce - and how do they compound over iterative RLHF cycles?
DPO's beta parameter controls the strength of the KL constraint. Higher beta keeps the model closer to the reference SFT model. What happens when beta approaches 0 (no KL constraint) or infinity (model cannot move from reference)?
Constitutional AI uses an AI model to generate self-critique before human raters evaluate. This creates a feedback loop: the AI evaluating itself is also trained by RLHF. How are potential circular biases identified and corrected?

Связанные уроки

gai-06 — Fine-tuning is the base RLHF aligns on top of
gai-22 — Alignment from preferences underpins safety work
rl-12 — PPO in RLHF is policy-gradient reinforcement learning
ml-50-policy-gradient — Reward optimization mirrors classic policy-gradient methods
aie-65-alignment-rlhf-dpo — Production view of the same RLHF and DPO pipeline
ml-05-evaluation

Generative AI

RLHF and DPO

Предварительные знания

Supervised fine-tuning (SFT) as the step before preference tuning
Basic idea of a loss function and gradient-based optimization

Anthropic's Claude uses Constitutional AI (a variant of RLHF) where the model critiques and revises its own outputs against a set of principles before human raters evaluate. This reduces the volume of human preference data needed while improving safety alignment.
Meta's Llama 2-Chat was trained with RLHF using over 1 million human preference annotations. The reward model trained on these annotations was then used to generate millions of additional preference pairs via rejection sampling.
Google's Gemini models use a combination of RLHF and RLAIF (RL from AI Feedback) - the AI model itself evaluates response quality, supplementing human raters for scale. Pure human annotation cannot scale to the billions of samples needed for frontier models.

Teaching models what people prefer

Reward Modeling

What does a reward model output for a given prompt-response pair?

PPO in RLHF

Why is a KL divergence penalty added to the PPO RLHF objective?

Direct Preference Optimization

What makes DPO simpler to train than PPO-based RLHF?

Preference Optimization in Practice

What is "reward hacking" in the context of RLHF?

Key Ideas

**Reward modeling:** a learned function that scores outputs according to human preferences; trained on comparison pairs (response A vs B, which is better?)
**PPO (Proximal Policy Optimization):** the RL algorithm used in classic RLHF; optimizes the language model to maximize reward model score while staying close to the SFT reference model via KL penalty
**DPO (Direct Preference Optimization):** a simpler alternative to PPO that skips the reward model entirely and fine-tunes directly on preference pairs using a closed-form loss
**Preference optimization:** the broader field of aligning model outputs with human values through comparison data rather than scalar rewards

Вопросы для размышления

RLHF reward models are trained on comparisons by human annotators who are paid per annotation and often work quickly. What systematic biases does this introduce - and how do they compound over iterative RLHF cycles?
DPO's beta parameter controls the strength of the KL constraint. Higher beta keeps the model closer to the reference SFT model. What happens when beta approaches 0 (no KL constraint) or infinity (model cannot move from reference)?
Constitutional AI uses an AI model to generate self-critique before human raters evaluate. This creates a feedback loop: the AI evaluating itself is also trained by RLHF. How are potential circular biases identified and corrected?

Связанные уроки

gai-06 — Fine-tuning is the base RLHF aligns on top of
gai-22 — Alignment from preferences underpins safety work
rl-12 — PPO in RLHF is policy-gradient reinforcement learning
ml-50-policy-gradient — Reward optimization mirrors classic policy-gradient methods
aie-65-alignment-rlhf-dpo — Production view of the same RLHF and DPO pipeline
ml-05-evaluation

RLHF and DPO

Предварительные знания

Teaching models what people prefer

Reward Modeling

PPO in RLHF

Direct Preference Optimization

Preference Optimization in Practice

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

RLHF and DPO

Предварительные знания

Teaching models what people prefer

Reward Modeling

PPO in RLHF

Direct Preference Optimization

Preference Optimization in Practice

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки