AI Engineering

Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

Цели урока

  • Understand the deep gap between a pretrained and an aligned model
  • Break down RLHF (InstructGPT 2022): SFT, Reward Model, PPO - the purpose of each step
  • Understand DPO: why it replaced RLHF in most open-source models
  • Learn Constitutional AI and RLAIF - how Anthropic solved alignment at scale

November 2022 - ChatGPT launches. GPT-3.5 under the hood - but not raw GPT-3.5. InstructGPT: 1.3B parameters against 175B in GPT-3, yet humans preferred it 85% of the time. Not because it was smarter - because it was aligned. In the 18 months between GPT-3 and ChatGPT, the same thing happened that separates a hammer from a surgical instrument: size doesn't matter, precise application does. Alignment is that precision.

  • InstructGPT (Ouyang et al., 2022) - first public demonstration of RLHF at scale: a 1.3B model outperformed 175B GPT-3 on helpfulness
  • DPO (Rafailov et al., 2023) - now the standard in open-source: Llama 3, Mistral, most Hugging Face models use DPO or its variants
  • Constitutional AI (Anthropic, 2022) - underpins Claude: model behavior is defined by explicit principles, not just human feedback
  • RLAIF scales alignment: instead of 1,000 human annotators - an AI judge. 100x cost reduction at comparable quality

Предварительные знания

  • How an LLM works: pretraining, next-token prediction, autoregressive generation
  • Working knowledge of fine-tuning: training a model further on extra data
  • Familiarity with the concept of a loss function and gradient descent
  • How LLMs Work: From Tokens to Generation
  • Fine-tuning: LoRA, QLoRA, PEFT

From RL on Human Preferences to DPO: How LLM Personality Was Built

Alignment traces back to 2017: Christiano et al. (OpenAI and DeepMind) in 'Deep Reinforcement Learning from Human Preferences' showed an agent could be trained on pairwise human comparisons instead of an explicit reward function, fitting a reward model from those comparisons. Five years later that idea became the backbone of ChatGPT. In 2022 Ouyang et al. (OpenAI) published InstructGPT: three-step RLHF (SFT, reward model, PPO), where a 1.3B-parameter model after alignment was preferred by humans over 175B GPT-3. That same year, 2022, Anthropic introduced Constitutional AI and RLAIF: instead of thousands of human ratings, the model critiques and rewrites its own answers against a set of principles (a constitution), with an AI model acting as judge. In May 2023 Rafailov et al. (Stanford) released DPO (Direct Preference Optimization, arXiv 2305.18290, later NeurIPS 2023): they showed mathematically that the reward model in RLHF can be expressed in closed form, removing PPO and replacing the whole pipeline with a standard classification loss. DPO proved simpler and more stable than PPO, and by 2024-2025 it became the standard for open-source models such as Llama 3 and Mistral.

Pretraining vs Alignment: Two Completely Different Processes

GPT-3 in 2020 knew one thing: predict the next token. Ask it to write a letter and it continued in the style of the question instead of writing a letter. Ask it to solve a problem and it cited similar problems from the internet. This was not the model being stupid - it was precisely executing its training objective. **Pretraining optimizes not for usefulness - but for statistical closeness to the training corpus.**

Alignment is a separate process on top of pretraining. Its goal: redirect the model from 'predict text' to 'be a helpful assistant'. This is a fundamental shift in the objective function, not just additional training.

The gap between GPT-3 and InstructGPT - despite InstructGPT being **smaller** in parameters - became the definitive proof: in 2026 alignment matters more than size. A 1.3B-parameter InstructGPT outperformed 175B GPT-3 in 85% of human evaluations on instruction following (Ouyang et al., 2022).

**Terminology:** 'alignment' literally means aligning model goals with human goals. An unaligned model is a powerful tool with no direction. An aligned one is the same tool trained to be helpful in a specific way.

Why did GPT-3 fail to follow instructions despite being a powerful model?

RLHF: InstructGPT and Teaching Models to Follow Instructions

Ouyang et al. (OpenAI, 2022) described a three-step process that turned GPT into ChatGPT. Called **RLHF** - Reinforcement Learning from Human Feedback. Each step solves a specific problem with the previous one.

The key RLHF insight: **the reward model is a compression of human preferences**. A labeler cannot evaluate 100,000 responses, but can rank 50,000 pairs. The RM generalizes these preferences into a differentiable function that PPO can optimize. This is why 13K demonstrations were enough to align a 175B model.

StepDataGoalOutput
SFT~13K demonstrationsAssistant response formatSFT model
RM Training~33K comparisonsPredict human preferenceReward Model
PPO/RLRM feedbackMaximize rewardInstructGPT

**The RLHF problem:** expensive and unstable. Requires thousands of human comparisons. PPO is a complex algorithm with unstable training. KL divergence needs careful tuning. This is exactly what motivated the search for alternatives that led to DPO.

Why does RLHF need a separate Reward Model instead of direct human feedback?

DPO: RLHF Without a Reward Model - 10x Simpler

Rafailov et al. (Stanford, 2023) showed that the reward model in RLHF is an intermediate artifact that can be eliminated. **DPO (Direct Preference Optimization)** solves the same problem through a standard cross-entropy loss, directly on (chosen/rejected) response pairs. No RL. No PPO. No separate model.

Mathematical intuition: RLHF frames the problem as RL with a reward model. DPO shows that the optimal policy under a reward model has a closed-form expression as a log-ratio of probabilities. This allows substituting the analytical solution directly into the loss - removing the reward model from the loop.

PropertyRLHFDPO
Reward modelRequired (separate model)Not needed
RL algorithmPPO (complex)None (cross-entropy)
Training stabilityUnstableStable
Data pipelineComparisons -> RM -> PPO(chosen, rejected) pairs directly
Implementation complexityHighLow
Quality (in practice)ComparableComparable

By 2024-2025 DPO became the standard. Mistral, Llama 3, most open-source models use DPO or its variants (IPO, KTO, ORPO). RLHF with PPO remained in labs with resources for unstable training - OpenAI, Anthropic in early versions.

**Practically:** DPO requires a preference dataset - (prompt, chosen_response, rejected_response) triples. Public datasets exist: Anthropic HH-RLHF, OpenAssistant, UltraFeedback. Hugging Face TRL library implements DPOTrainer out of the box.

The main engineering advantage of DPO over RLHF for adding alignment to a model:

Constitutional AI and RLAIF: Models That Align Themselves

Anthropic (2022) published Constitutional AI: an approach where instead of human feedback, the model uses a **constitution** - a set of principles - to self-evaluate its responses. RLAIF (RL from AI Feedback) is a generalization: use another AI model as a judge instead of humans.

Why this matters for engineers: Constitutional AI makes **evaluation criteria explicit and auditable**. Standard RLHF is a black box - unclear exactly what human labelers evaluate. A constitution is like code: readable, modifiable, version-controlled, auditable for regulatory compliance.

MethodFeedback sourceScaleTransparency
RLHFHuman annotatorsExpensive, slowLow (implicit preferences)
DPOHuman comparisonsCheaper than RM+PPOMedium (pairs visible)
Constitutional AIConstitution + AI critiqueScales without humansHigh (explicit principles)
RLAIFAI model as judgeCheap and fastDepends on judge model

**Claude as an example:** Claude 2 and later versions were trained using Constitutional AI. This means Claude's behavior is partly determined by an explicit set of principles - as opposed to models where alignment is entirely encoded in human preferences.

What is the main engineering advantage of Constitutional AI over standard RLHF?

Alignment is just additional training on good data

Alignment changes the objective function: from token prediction to maximizing human preferences. This is not cosmetic - it is a completely different optimization problem

A pretrained GPT already knows everything it needs to know. The problem is not knowledge - it is that the model doesn't optimize for 'be helpful'. RLHF/DPO change the training objective itself through a reward signal.

A bigger model is a better-aligned model

InstructGPT 1.3B outperformed GPT-3 175B on instruction following. Alignment is more effective than scaling for assistant tasks

Model size determines capability (knowledge, reasoning). Alignment determines behaviour (instruction following, harmlessness). These are orthogonal axes. Claude 3 Haiku is better aligned than many large unaligned open-source models.

Key Takeaways

  • Pretraining = token prediction. Alignment = learning to be helpful. Different problems with different data and methods
  • RLHF (2022): three steps - SFT -> Reward Model -> PPO. The reward model compresses human preferences into a differentiable function
  • DPO (2023): eliminates the reward model and PPO, optimizes directly on (chosen/rejected) pairs. Standard for open-source 2024-2026
  • Constitutional AI: principles -> AI-critique -> self-revision. Alignment criteria become explicit, auditable, version-controlled
  • RLAIF: AI model as judge instead of humans. Scales orders of magnitude cheaper than human labeling

Вопросы для размышления

  • If adding alignment to a custom model for a specific product - which method to choose and why: RLHF, DPO, or Constitutional AI?
  • How is a 'constitution' as a set of principles better or worse than implicit human preferences from RLHF for a production system?
  • How does alignment affect what a model refuses to do? Is this a bug or a feature - and who should control these boundaries?

Related Topics

Alignment is the foundation for fine-tuning and advanced reasoning. Understanding these methods helps in selecting and configuring models correctly.

  • Fine-tuning: LoRA, QLoRA, PEFT — DPO is implemented on top of fine-tuning infrastructure
  • Reasoning Models — RLVR (Verifiable Rewards) is an RLHF variant for reasoning
  • Guardrails and Safety — Alignment sets the baseline safety; guardrails add application-level control

Связанные уроки

  • aie-03-llm-fundamentals — Alignment starts from pretrained LLM internals
  • aie-36-fine-tuning — RLHF and DPO are alignment-stage fine-tuning
  • aie-33-guardrails — Constitutional AI encodes guardrails into training
  • aie-53-future-reasoning — RL on traces trains reasoning models
  • ml-50-policy-gradient — RLHF optimizes a policy via reward gradients
Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

0

1

Sign In