Natural Language Processing

RLHF and Alignment

GPT-3 without RLHF could generate misinformation in a confident tone, assist with harmful requests, and continue unwanted patterns from training data. Between raw GPT-3 and ChatGPT sit thousands of hours of human labeling, a reward model, and months of PPO. That is alignment.

**ChatGPT** - the first OpenAI product where RLHF became visible to users: the model responds helpfully and declines harmful requests
**Claude** (Anthropic) - trained with Constitutional AI: less human labeling, more model self-critique guided by a set of principles
**Llama-3-Instruct** - open-source DPO example: Meta publishes both preference data and weights, making the process transparent to study

Предварительные знания

What a pretrained LLM is and why instruction tuning is needed
The basic idea of reinforcement learning: policy, reward, optimization
Why next-token prediction does not optimize for helpfulness and harmlessness

Large Language Models

From Preference Learning to InstructGPT

2017. Paul Christiano and co-authors at OpenAI and DeepMind publish "Deep Reinforcement Learning from Human Preferences". The idea: instead of hand-specifying a reward function, show a human pairs of agent behaviors and ask which is better, then train a reward model on those comparisons. The method worked on simulations and Atari games. Five years later, in 2022, Long Ouyang and the OpenAI team applied the same scheme to language models in the InstructGPT paper, and a 1.3B-parameter model trained with RLHF was preferred by humans over the raw 175B GPT-3. That same year, 2022, Anthropic introduced Constitutional AI, replacing part of the human labeling with model self-critique against a set of principles. In 2023, Rafailov and co-authors derived DPO, showing the reward model and PPO can be collapsed into a single supervised loss.

Reward Model: Teaching a Model to Judge Quality

There is no way to directly tell an LLM 'be helpful'. There is no differentiable function that measures helpfulness. A reward model is the workaround: a separate neural network trained to predict which of two responses a human would prefer.

Data collection for the reward model: thousands of pairs (response A, response B) where human raters choose the better one. A Bradley-Terry model then learns to predict these preferences. The result is a scalar function reward(prompt, response), where a high value means a good response in humans' judgment.

The quality of the reward model is the bottleneck of the entire RLHF process. If annotators are unreliable or biased, the reward model amplifies that bias. InstructGPT used ~40 carefully selected annotators. Anthropic experimented with Constitutional AI to reduce dependence on human labeling.

Why is a separate reward model needed rather than directly optimizing the LLM against human preferences?

PPO in RLHF: Policy Optimization with Constraints

The LLM as an RL policy: state = prompt + generated context, action = next token, reward = reward model score after the response is complete. PPO updates the LLM weights to maximize expected reward.

The main problem with naive RL here: without constraints, the model quickly discovers reward hacking - generating responses with a high reward model score that are meaningless to humans. The classic example: repeating a single word that the reward model inexplicably rates highly.

PPO in RLHF requires keeping 4 models in memory simultaneously: actor (current LLM), reference (frozen original LLM for KL), reward model, critic (value function). For a 70B model this exceeds 1TB VRAM. This is precisely why DPO is gaining popularity as a more memory-efficient alternative.

Why is a KL-divergence penalty added in RLHF?

DPO: Alignment Without a Reward Model

Direct Preference Optimization (Rafailov et al., 2023) is an algebraic trick that eliminates the reward model and PPO from the pipeline entirely. The observation: the optimal policy under the RLHF objective has a closed form. One can train the LLM directly on preference data.

DPO reformulates the problem as a supervised loss. Input: (chosen, rejected) pairs for each prompt. The loss increases the probability of chosen and decreases the probability of rejected relative to the reference model. No RL, no value functions, no sampling during training.

DPO in practice: Llama-3-Instruct, Mistral-Instruct, Phi-3 - most modern open-source models use DPO instead of PPO. It requires half the memory (2 models instead of 4) and trains more stably. The caveat: DPO is more sensitive to the quality of preference data.

What is the key advantage of DPO over RLHF with PPO?

Constitutional AI: Alignment Through Principles

Constitutional AI (Anthropic, 2022) is an alternative approach: instead of preference pairs from humans, the model is trained to follow a set of principles (a 'constitution'). The LLM itself generates critique of its own responses and revisions based on the constitution.

The CAI cycle: the model generates a response, critiques it according to the principles, rewrites it, repeats. Anthropic's constitution includes principles such as 'do not help with harmful actions' and 'be honest'. Final revision/original pairs are used as preference data for RLHF.

CAI reduces dependence on human raters: most of the labeling is done by a strong LLM itself. This scales better than RLHF. Claude (Anthropic) is trained using CAI. The downside: the 'constitution' itself requires careful human design - the problem shifts rather than disappears.

RLHF and DPO fully solve the alignment problem

RLHF/DPO reduce harmful outputs but do not eliminate alignment issues - models can still give confidently wrong answers

Alignment is multidimensional: harmlessness, honesty, and helpfulness can conflict. Current methods optimize proxy metrics (human approval), not true objectives. Goodhart's Law is inevitable

What is the 'constitution' in Constitutional AI?

Key Ideas

**Reward model** is trained to predict human preferences between responses, turning subjective feedback into a differentiable signal
**PPO** updates the LLM to maximize reward with a KL penalty against the original model - without the penalty, reward hacking is inevitable
**DPO** eliminates the reward model and RL, directly training the LLM on preference pairs - simpler, more stable, less memory
**Constitutional AI** delegates critique to the model itself via a set of principles, reducing dependence on human annotators

Вопросы для размышления

Can a reward model become a better judge of text quality than humans? When is this useful, when is it dangerous?
Why is DPO more sensitive to preference data quality than PPO with a reward model?
If a model is trained to be helpful and harmless simultaneously, what happens when those goals conflict?

Связанные уроки

nlp-15 — LLMs and scaling laws - the foundation for understanding why alignment is needed
rl-05 — PPO from RL is used directly in the RLHF pipeline
rl-10 — PPO in detail - the optimization algorithm in RLHF
nlp-17 — RAG is built on aligned models, not raw pretrained ones
rl-17 — RLHF for AI alignment - a dedicated lesson with full details
stat-05-hypothesis

Natural Language Processing

RLHF and Alignment

**ChatGPT** - the first OpenAI product where RLHF became visible to users: the model responds helpfully and declines harmful requests
**Claude** (Anthropic) - trained with Constitutional AI: less human labeling, more model self-critique guided by a set of principles
**Llama-3-Instruct** - open-source DPO example: Meta publishes both preference data and weights, making the process transparent to study

Предварительные знания

What a pretrained LLM is and why instruction tuning is needed
The basic idea of reinforcement learning: policy, reward, optimization
Why next-token prediction does not optimize for helpfulness and harmlessness

Large Language Models

From Preference Learning to InstructGPT

Reward Model: Teaching a Model to Judge Quality

Why is a separate reward model needed rather than directly optimizing the LLM against human preferences?

PPO in RLHF: Policy Optimization with Constraints

The LLM as an RL policy: state = prompt + generated context, action = next token, reward = reward model score after the response is complete. PPO updates the LLM weights to maximize expected reward.

Why is a KL-divergence penalty added in RLHF?

DPO: Alignment Without a Reward Model

What is the key advantage of DPO over RLHF with PPO?

Constitutional AI: Alignment Through Principles

RLHF and DPO fully solve the alignment problem

RLHF/DPO reduce harmful outputs but do not eliminate alignment issues - models can still give confidently wrong answers

Alignment is multidimensional: harmlessness, honesty, and helpfulness can conflict. Current methods optimize proxy metrics (human approval), not true objectives. Goodhart's Law is inevitable

What is the 'constitution' in Constitutional AI?

Key Ideas

**Reward model** is trained to predict human preferences between responses, turning subjective feedback into a differentiable signal
**PPO** updates the LLM to maximize reward with a KL penalty against the original model - without the penalty, reward hacking is inevitable
**DPO** eliminates the reward model and RL, directly training the LLM on preference pairs - simpler, more stable, less memory
**Constitutional AI** delegates critique to the model itself via a set of principles, reducing dependence on human annotators

Вопросы для размышления

Can a reward model become a better judge of text quality than humans? When is this useful, when is it dangerous?
Why is DPO more sensitive to preference data quality than PPO with a reward model?
If a model is trained to be helpful and harmless simultaneously, what happens when those goals conflict?

Связанные уроки

nlp-15 — LLMs and scaling laws - the foundation for understanding why alignment is needed
rl-05 — PPO from RL is used directly in the RLHF pipeline
rl-10 — PPO in detail - the optimization algorithm in RLHF
nlp-17 — RAG is built on aligned models, not raw pretrained ones
rl-17 — RLHF for AI alignment - a dedicated lesson with full details
stat-05-hypothesis

RLHF and Alignment

Предварительные знания

From Preference Learning to InstructGPT

Reward Model: Teaching a Model to Judge Quality

PPO in RLHF: Policy Optimization with Constraints

DPO: Alignment Without a Reward Model

Constitutional AI: Alignment Through Principles

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки

RLHF and Alignment

Предварительные знания

From Preference Learning to InstructGPT

Reward Model: Teaching a Model to Judge Quality

PPO in RLHF: Policy Optimization with Constraints

DPO: Alignment Without a Reward Model

Constitutional AI: Alignment Through Principles

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки