AI Engineering
Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI
Цели урока
- Understand the deep gap between a pretrained and an aligned model
- Break down RLHF (InstructGPT 2022): SFT, Reward Model, PPO - the purpose of each step
- Understand DPO: why it replaced RLHF in most open-source models
- Learn Constitutional AI and RLAIF - how Anthropic solved alignment at scale
November 2022 - ChatGPT launches. GPT-3.5 under the hood - but not raw GPT-3.5. InstructGPT: 1.3B parameters against 175B in GPT-3, yet humans preferred it 85% of the time. Not because it was smarter - because it was aligned. In the 18 months between GPT-3 and ChatGPT, the same thing happened that separates a hammer from a surgical instrument: size doesn't matter, precise application does. Alignment is that precision.
- InstructGPT (Ouyang et al., 2022) - first public demonstration of RLHF at scale: a 1.3B model outperformed 175B GPT-3 on helpfulness
- DPO (Rafailov et al., 2023) - now the standard in open-source: Llama 3, Mistral, most Hugging Face models use DPO or its variants
- Constitutional AI (Anthropic, 2022) - underpins Claude: model behavior is defined by explicit principles, not just human feedback
- RLAIF scales alignment: instead of 1,000 human annotators - an AI judge. 100x cost reduction at comparable quality
Предварительные знания
- How an LLM works: pretraining, next-token prediction, autoregressive generation
- Working knowledge of fine-tuning: training a model further on extra data
- Familiarity with the concept of a loss function and gradient descent
From RL on Human Preferences to DPO: How LLM Personality Was Built
Alignment traces back to 2017: Christiano et al. (OpenAI and DeepMind) in 'Deep Reinforcement Learning from Human Preferences' showed an agent could be trained on pairwise human comparisons instead of an explicit reward function, fitting a reward model from those comparisons. Five years later that idea became the backbone of ChatGPT. In 2022 Ouyang et al. (OpenAI) published InstructGPT: three-step RLHF (SFT, reward model, PPO), where a 1.3B-parameter model after alignment was preferred by humans over 175B GPT-3. That same year, 2022, Anthropic introduced Constitutional AI and RLAIF: instead of thousands of human ratings, the model critiques and rewrites its own answers against a set of principles (a constitution), with an AI model acting as judge. In May 2023 Rafailov et al. (Stanford) released DPO (Direct Preference Optimization, arXiv 2305.18290, later NeurIPS 2023): they showed mathematically that the reward model in RLHF can be expressed in closed form, removing PPO and replacing the whole pipeline with a standard classification loss. DPO proved simpler and more stable than PPO, and by 2024-2025 it became the standard for open-source models such as Llama 3 and Mistral.
Pretraining vs Alignment: Two Completely Different Processes
GPT-3 in 2020 knew one thing: predict the next token. Ask it to write a letter and it continued in the style of the question instead of writing a letter. Ask it to solve a problem and it cited similar problems from the internet. This was not the model being stupid - it was precisely executing its training objective. **Pretraining optimizes not for usefulness - but for statistical closeness to the training corpus.**
Alignment is a separate process on top of pretraining. Its goal: redirect the model from 'predict text' to 'be a helpful assistant'. This is a fundamental shift in the objective function, not just additional training.
The gap between GPT-3 and InstructGPT - despite InstructGPT being **smaller** in parameters - became the definitive proof: in 2026 alignment matters more than size. A 1.3B-parameter InstructGPT outperformed 175B GPT-3 in 85% of human evaluations on instruction following (Ouyang et al., 2022).
**Terminology:** 'alignment' literally means aligning model goals with human goals. An unaligned model is a powerful tool with no direction. An aligned one is the same tool trained to be helpful in a specific way.
Why did GPT-3 fail to follow instructions despite being a powerful model?
RLHF: InstructGPT and Teaching Models to Follow Instructions
Ouyang et al. (OpenAI, 2022) described a three-step process that turned GPT into ChatGPT. Called **RLHF** - Reinforcement Learning from Human Feedback. Each step solves a specific problem with the previous one.
The key RLHF insight: **the reward model is a compression of human preferences**. A labeler cannot evaluate 100,000 responses, but can rank 50,000 pairs. The RM generalizes these preferences into a differentiable function that PPO can optimize. This is why 13K demonstrations were enough to align a 175B model.
| Step | Data | Goal | Output |
|---|---|---|---|
| SFT | ~13K demonstrations | Assistant response format | SFT model |
| RM Training | ~33K comparisons | Predict human preference | Reward Model |
| PPO/RL | RM feedback | Maximize reward | InstructGPT |
**The RLHF problem:** expensive and unstable. Requires thousands of human comparisons. PPO is a complex algorithm with unstable training. KL divergence needs careful tuning. This is exactly what motivated the search for alternatives that led to DPO.
Why does RLHF need a separate Reward Model instead of direct human feedback?
DPO: RLHF Without a Reward Model - 10x Simpler
Rafailov et al. (Stanford, 2023) showed that the reward model in RLHF is an intermediate artifact that can be eliminated. **DPO (Direct Preference Optimization)** solves the same problem through a standard cross-entropy loss, directly on (chosen/rejected) response pairs. No RL. No PPO. No separate model.
Mathematical intuition: RLHF frames the problem as RL with a reward model. DPO shows that the optimal policy under a reward model has a closed-form expression as a log-ratio of probabilities. This allows substituting the analytical solution directly into the loss - removing the reward model from the loop.
| Property | RLHF | DPO |
|---|---|---|
| Reward model | Required (separate model) | Not needed |
| RL algorithm | PPO (complex) | None (cross-entropy) |
| Training stability | Unstable | Stable |
| Data pipeline | Comparisons -> RM -> PPO | (chosen, rejected) pairs directly |
| Implementation complexity | High | Low |
| Quality (in practice) | Comparable | Comparable |
By 2024-2025 DPO became the standard. Mistral, Llama 3, most open-source models use DPO or its variants (IPO, KTO, ORPO). RLHF with PPO remained in labs with resources for unstable training - OpenAI, Anthropic in early versions.
**Practically:** DPO requires a preference dataset - (prompt, chosen_response, rejected_response) triples. Public datasets exist: Anthropic HH-RLHF, OpenAssistant, UltraFeedback. Hugging Face TRL library implements DPOTrainer out of the box.
The main engineering advantage of DPO over RLHF for adding alignment to a model:
Constitutional AI and RLAIF: Models That Align Themselves
Anthropic (2022) published Constitutional AI: an approach where instead of human feedback, the model uses a **constitution** - a set of principles - to self-evaluate its responses. RLAIF (RL from AI Feedback) is a generalization: use another AI model as a judge instead of humans.
Why this matters for engineers: Constitutional AI makes **evaluation criteria explicit and auditable**. Standard RLHF is a black box - unclear exactly what human labelers evaluate. A constitution is like code: readable, modifiable, version-controlled, auditable for regulatory compliance.
| Method | Feedback source | Scale | Transparency |
|---|---|---|---|
| RLHF | Human annotators | Expensive, slow | Low (implicit preferences) |
| DPO | Human comparisons | Cheaper than RM+PPO | Medium (pairs visible) |
| Constitutional AI | Constitution + AI critique | Scales without humans | High (explicit principles) |
| RLAIF | AI model as judge | Cheap and fast | Depends on judge model |
**Claude as an example:** Claude 2 and later versions were trained using Constitutional AI. This means Claude's behavior is partly determined by an explicit set of principles - as opposed to models where alignment is entirely encoded in human preferences.
What is the main engineering advantage of Constitutional AI over standard RLHF?
Alignment is just additional training on good data
Alignment changes the objective function: from token prediction to maximizing human preferences. This is not cosmetic - it is a completely different optimization problem
A pretrained GPT already knows everything it needs to know. The problem is not knowledge - it is that the model doesn't optimize for 'be helpful'. RLHF/DPO change the training objective itself through a reward signal.
A bigger model is a better-aligned model
InstructGPT 1.3B outperformed GPT-3 175B on instruction following. Alignment is more effective than scaling for assistant tasks
Model size determines capability (knowledge, reasoning). Alignment determines behaviour (instruction following, harmlessness). These are orthogonal axes. Claude 3 Haiku is better aligned than many large unaligned open-source models.
Key Takeaways
- Pretraining = token prediction. Alignment = learning to be helpful. Different problems with different data and methods
- RLHF (2022): three steps - SFT -> Reward Model -> PPO. The reward model compresses human preferences into a differentiable function
- DPO (2023): eliminates the reward model and PPO, optimizes directly on (chosen/rejected) pairs. Standard for open-source 2024-2026
- Constitutional AI: principles -> AI-critique -> self-revision. Alignment criteria become explicit, auditable, version-controlled
- RLAIF: AI model as judge instead of humans. Scales orders of magnitude cheaper than human labeling
Вопросы для размышления
- If adding alignment to a custom model for a specific product - which method to choose and why: RLHF, DPO, or Constitutional AI?
- How is a 'constitution' as a set of principles better or worse than implicit human preferences from RLHF for a production system?
- How does alignment affect what a model refuses to do? Is this a bug or a feature - and who should control these boundaries?
Related Topics
Alignment is the foundation for fine-tuning and advanced reasoning. Understanding these methods helps in selecting and configuring models correctly.
- Fine-tuning: LoRA, QLoRA, PEFT — DPO is implemented on top of fine-tuning infrastructure
- Reasoning Models — RLVR (Verifiable Rewards) is an RLHF variant for reasoning
- Guardrails and Safety — Alignment sets the baseline safety; guardrails add application-level control
Связанные уроки
- aie-03-llm-fundamentals — Alignment starts from pretrained LLM internals
- aie-36-fine-tuning — RLHF and DPO are alignment-stage fine-tuning
- aie-33-guardrails — Constitutional AI encodes guardrails into training
- aie-53-future-reasoning — RL on traces trains reasoning models
- ml-50-policy-gradient — RLHF optimizes a policy via reward gradients