Natural Language Processing
RLHF and Alignment
GPT-3 without RLHF could generate misinformation in a confident tone, assist with harmful requests, and continue unwanted patterns from training data. Between raw GPT-3 and ChatGPT sit thousands of hours of human labeling, a reward model, and months of PPO. That is alignment.
- **ChatGPT** - the first OpenAI product where RLHF became visible to users: the model responds helpfully and declines harmful requests
- **Claude** (Anthropic) - trained with Constitutional AI: less human labeling, more model self-critique guided by a set of principles
- **Llama-3-Instruct** - open-source DPO example: Meta publishes both preference data and weights, making the process transparent to study
Предварительные знания
- What a pretrained LLM is and why instruction tuning is needed
- The basic idea of reinforcement learning: policy, reward, optimization
- Why next-token prediction does not optimize for helpfulness and harmlessness
From Preference Learning to InstructGPT
2017. Paul Christiano and co-authors at OpenAI and DeepMind publish "Deep Reinforcement Learning from Human Preferences". The idea: instead of hand-specifying a reward function, show a human pairs of agent behaviors and ask which is better, then train a reward model on those comparisons. The method worked on simulations and Atari games. Five years later, in 2022, Long Ouyang and the OpenAI team applied the same scheme to language models in the InstructGPT paper, and a 1.3B-parameter model trained with RLHF was preferred by humans over the raw 175B GPT-3. That same year, 2022, Anthropic introduced Constitutional AI, replacing part of the human labeling with model self-critique against a set of principles. In 2023, Rafailov and co-authors derived DPO, showing the reward model and PPO can be collapsed into a single supervised loss.
Reward Model: Teaching a Model to Judge Quality
There is no way to directly tell an LLM 'be helpful'. There is no differentiable function that measures helpfulness. A reward model is the workaround: a separate neural network trained to predict which of two responses a human would prefer.
Data collection for the reward model: thousands of pairs (response A, response B) where human raters choose the better one. A Bradley-Terry model then learns to predict these preferences. The result is a scalar function reward(prompt, response), where a high value means a good response in humans' judgment.
The quality of the reward model is the bottleneck of the entire RLHF process. If annotators are unreliable or biased, the reward model amplifies that bias. InstructGPT used ~40 carefully selected annotators. Anthropic experimented with Constitutional AI to reduce dependence on human labeling.
Why is a separate reward model needed rather than directly optimizing the LLM against human preferences?
PPO in RLHF: Policy Optimization with Constraints
The LLM as an RL policy: state = prompt + generated context, action = next token, reward = reward model score after the response is complete. PPO updates the LLM weights to maximize expected reward.
The main problem with naive RL here: without constraints, the model quickly discovers reward hacking - generating responses with a high reward model score that are meaningless to humans. The classic example: repeating a single word that the reward model inexplicably rates highly.
PPO in RLHF requires keeping 4 models in memory simultaneously: actor (current LLM), reference (frozen original LLM for KL), reward model, critic (value function). For a 70B model this exceeds 1TB VRAM. This is precisely why DPO is gaining popularity as a more memory-efficient alternative.
Why is a KL-divergence penalty added in RLHF?
DPO: Alignment Without a Reward Model
Direct Preference Optimization (Rafailov et al., 2023) is an algebraic trick that eliminates the reward model and PPO from the pipeline entirely. The observation: the optimal policy under the RLHF objective has a closed form. One can train the LLM directly on preference data.
DPO reformulates the problem as a supervised loss. Input: (chosen, rejected) pairs for each prompt. The loss increases the probability of chosen and decreases the probability of rejected relative to the reference model. No RL, no value functions, no sampling during training.
DPO in practice: Llama-3-Instruct, Mistral-Instruct, Phi-3 - most modern open-source models use DPO instead of PPO. It requires half the memory (2 models instead of 4) and trains more stably. The caveat: DPO is more sensitive to the quality of preference data.
What is the key advantage of DPO over RLHF with PPO?
Constitutional AI: Alignment Through Principles
Constitutional AI (Anthropic, 2022) is an alternative approach: instead of preference pairs from humans, the model is trained to follow a set of principles (a 'constitution'). The LLM itself generates critique of its own responses and revisions based on the constitution.
The CAI cycle: the model generates a response, critiques it according to the principles, rewrites it, repeats. Anthropic's constitution includes principles such as 'do not help with harmful actions' and 'be honest'. Final revision/original pairs are used as preference data for RLHF.
CAI reduces dependence on human raters: most of the labeling is done by a strong LLM itself. This scales better than RLHF. Claude (Anthropic) is trained using CAI. The downside: the 'constitution' itself requires careful human design - the problem shifts rather than disappears.
RLHF and DPO fully solve the alignment problem
RLHF/DPO reduce harmful outputs but do not eliminate alignment issues - models can still give confidently wrong answers
Alignment is multidimensional: harmlessness, honesty, and helpfulness can conflict. Current methods optimize proxy metrics (human approval), not true objectives. Goodhart's Law is inevitable
What is the 'constitution' in Constitutional AI?
Related Topics
RLHF/DPO connect NLP and reinforcement learning through shared algorithms:
- Large Language Models — LLMs are the subject of alignment - techniques analyzed here
- PPO: Proximal Policy Optimization — Optimization algorithm used in RLHF
- RLHF: RL for AI Alignment — Detailed analysis of RLHF from the RL perspective
Key Ideas
- **Reward model** is trained to predict human preferences between responses, turning subjective feedback into a differentiable signal
- **PPO** updates the LLM to maximize reward with a KL penalty against the original model - without the penalty, reward hacking is inevitable
- **DPO** eliminates the reward model and RL, directly training the LLM on preference pairs - simpler, more stable, less memory
- **Constitutional AI** delegates critique to the model itself via a set of principles, reducing dependence on human annotators
Вопросы для размышления
- Can a reward model become a better judge of text quality than humans? When is this useful, when is it dangerous?
- Why is DPO more sensitive to preference data quality than PPO with a reward model?
- If a model is trained to be helpful and harmless simultaneously, what happens when those goals conflict?
Связанные уроки
- nlp-15 — LLMs and scaling laws - the foundation for understanding why alignment is needed
- rl-05 — PPO from RL is used directly in the RLHF pipeline
- rl-10 — PPO in detail - the optimization algorithm in RLHF
- nlp-17 — RAG is built on aligned models, not raw pretrained ones
- rl-17 — RLHF for AI alignment - a dedicated lesson with full details
- stat-05-hypothesis