AI Engineering

Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

Цели урока

Understand the deep gap between a pretrained and an aligned model
Break down RLHF (InstructGPT 2022): SFT, Reward Model, PPO - the purpose of each step
Understand DPO: why it replaced RLHF in most open-source models
Learn Constitutional AI and RLAIF - how Anthropic solved alignment at scale

November 2022 - ChatGPT launches. GPT-3.5 under the hood - but not raw GPT-3.5. InstructGPT: 1.3B parameters against 175B in GPT-3, yet humans preferred it 85% of the time. Not because it was smarter - because it was aligned. In the 18 months between GPT-3 and ChatGPT, the same thing happened that separates a hammer from a surgical instrument: size doesn't matter, precise application does. Alignment is that precision.

InstructGPT (Ouyang et al., 2022) - first public demonstration of RLHF at scale: a 1.3B model outperformed 175B GPT-3 on helpfulness
DPO (Rafailov et al., 2023) - now the standard in open-source: Llama 3, Mistral, most Hugging Face models use DPO or its variants
Constitutional AI (Anthropic, 2022) - underpins Claude: model behavior is defined by explicit principles, not just human feedback
RLAIF scales alignment: instead of 1,000 human annotators - an AI judge. 100x cost reduction at comparable quality

Предварительные знания

How an LLM works: pretraining, next-token prediction, autoregressive generation
Working knowledge of fine-tuning: training a model further on extra data
Familiarity with the concept of a loss function and gradient descent

From RL on Human Preferences to DPO: How LLM Personality Was Built

Alignment traces back to 2017: Christiano et al. (OpenAI and DeepMind) in 'Deep Reinforcement Learning from Human Preferences' showed an agent could be trained on pairwise human comparisons instead of an explicit reward function, fitting a reward model from those comparisons. Five years later that idea became the backbone of ChatGPT. In 2022 Ouyang et al. (OpenAI) published InstructGPT: three-step RLHF (SFT, reward model, PPO), where a 1.3B-parameter model after alignment was preferred by humans over 175B GPT-3. That same year, 2022, Anthropic introduced Constitutional AI and RLAIF: instead of thousands of human ratings, the model critiques and rewrites its own answers against a set of principles (a constitution), with an AI model acting as judge. In May 2023 Rafailov et al. (Stanford) released DPO (Direct Preference Optimization, arXiv 2305.18290, later NeurIPS 2023): they showed mathematically that the reward model in RLHF can be expressed in closed form, removing PPO and replacing the whole pipeline with a standard classification loss. DPO proved simpler and more stable than PPO, and by 2024-2025 it became the standard for open-source models such as Llama 3 and Mistral.

Pretraining vs Alignment: Two Completely Different Processes

GPT-3 in 2020 knew one thing: predict the next token. Ask it to write a letter and it continued in the style of the question instead of writing a letter. Ask it to solve a problem and it cited similar problems from the internet. This was not the model being stupid - it was precisely executing its training objective. **Pretraining optimizes not for usefulness - but for statistical closeness to the training corpus.**

Alignment is a separate process on top of pretraining. Its goal: redirect the model from 'predict text' to 'be a helpful assistant'. This is a fundamental shift in the objective function, not just additional training.

The gap between GPT-3 and InstructGPT - despite InstructGPT being **smaller** in parameters - became the definitive proof: in 2026 alignment matters more than size. A 1.3B-parameter InstructGPT outperformed 175B GPT-3 in 85% of human evaluations on instruction following (Ouyang et al., 2022).

**Terminology:** 'alignment' literally means aligning model goals with human goals. An unaligned model is a powerful tool with no direction. An aligned one is the same tool trained to be helpful in a specific way.

Why did GPT-3 fail to follow instructions despite being a powerful model?

RLHF: InstructGPT and Teaching Models to Follow Instructions

Ouyang et al. (OpenAI, 2022) described a three-step process that turned GPT into ChatGPT. Called **RLHF** - Reinforcement Learning from Human Feedback. Each step solves a specific problem with the previous one.

The key RLHF insight: **the reward model is a compression of human preferences**. A labeler cannot evaluate 100,000 responses, but can rank 50,000 pairs. The RM generalizes these preferences into a differentiable function that PPO can optimize. This is why 13K demonstrations were enough to align a 175B model.

Step	Data	Goal	Output
SFT	~13K demonstrations	Assistant response format	SFT model
RM Training	~33K comparisons	Predict human preference	Reward Model
PPO/RL	RM feedback	Maximize reward	InstructGPT

**The RLHF problem:** expensive and unstable. Requires thousands of human comparisons. PPO is a complex algorithm with unstable training. KL divergence needs careful tuning. This is exactly what motivated the search for alternatives that led to DPO.

Why does RLHF need a separate Reward Model instead of direct human feedback?

DPO: RLHF Without a Reward Model - 10x Simpler

Rafailov et al. (Stanford, 2023) showed that the reward model in RLHF is an intermediate artifact that can be eliminated. **DPO (Direct Preference Optimization)** solves the same problem through a standard cross-entropy loss, directly on (chosen/rejected) response pairs. No RL. No PPO. No separate model.

Mathematical intuition: RLHF frames the problem as RL with a reward model. DPO shows that the optimal policy under a reward model has a closed-form expression as a log-ratio of probabilities. This allows substituting the analytical solution directly into the loss - removing the reward model from the loop.

Property	RLHF	DPO
Reward model	Required (separate model)	Not needed
RL algorithm	PPO (complex)	None (cross-entropy)
Training stability	Unstable	Stable
Data pipeline	Comparisons -> RM -> PPO	(chosen, rejected) pairs directly
Implementation complexity	High	Low
Quality (in practice)	Comparable	Comparable

By 2024-2025 DPO became the standard. Mistral, Llama 3, most open-source models use DPO or its variants (IPO, KTO, ORPO). RLHF with PPO remained in labs with resources for unstable training - OpenAI, Anthropic in early versions.

**Practically:** DPO requires a preference dataset - (prompt, chosen_response, rejected_response) triples. Public datasets exist: Anthropic HH-RLHF, OpenAssistant, UltraFeedback. Hugging Face TRL library implements DPOTrainer out of the box.

The main engineering advantage of DPO over RLHF for adding alignment to a model:

Constitutional AI and RLAIF: Models That Align Themselves

Anthropic (2022) published Constitutional AI: an approach where instead of human feedback, the model uses a **constitution** - a set of principles - to self-evaluate its responses. RLAIF (RL from AI Feedback) is a generalization: use another AI model as a judge instead of humans.

Why this matters for engineers: Constitutional AI makes **evaluation criteria explicit and auditable**. Standard RLHF is a black box - unclear exactly what human labelers evaluate. A constitution is like code: readable, modifiable, version-controlled, auditable for regulatory compliance.

Method	Feedback source	Scale	Transparency
RLHF	Human annotators	Expensive, slow	Low (implicit preferences)
DPO	Human comparisons	Cheaper than RM+PPO	Medium (pairs visible)
Constitutional AI	Constitution + AI critique	Scales without humans	High (explicit principles)
RLAIF	AI model as judge	Cheap and fast	Depends on judge model

**Claude as an example:** Claude 2 and later versions were trained using Constitutional AI. This means Claude's behavior is partly determined by an explicit set of principles - as opposed to models where alignment is entirely encoded in human preferences.

What is the main engineering advantage of Constitutional AI over standard RLHF?

Alignment is just additional training on good data

Alignment changes the objective function: from token prediction to maximizing human preferences. This is not cosmetic - it is a completely different optimization problem

A pretrained GPT already knows everything it needs to know. The problem is not knowledge - it is that the model doesn't optimize for 'be helpful'. RLHF/DPO change the training objective itself through a reward signal.

A bigger model is a better-aligned model

InstructGPT 1.3B outperformed GPT-3 175B on instruction following. Alignment is more effective than scaling for assistant tasks

Model size determines capability (knowledge, reasoning). Alignment determines behaviour (instruction following, harmlessness). These are orthogonal axes. Claude 3 Haiku is better aligned than many large unaligned open-source models.

Key Takeaways

Pretraining = token prediction. Alignment = learning to be helpful. Different problems with different data and methods
RLHF (2022): three steps - SFT -> Reward Model -> PPO. The reward model compresses human preferences into a differentiable function
DPO (2023): eliminates the reward model and PPO, optimizes directly on (chosen/rejected) pairs. Standard for open-source 2024-2026
Constitutional AI: principles -> AI-critique -> self-revision. Alignment criteria become explicit, auditable, version-controlled
RLAIF: AI model as judge instead of humans. Scales orders of magnitude cheaper than human labeling

Вопросы для размышления

If adding alignment to a custom model for a specific product - which method to choose and why: RLHF, DPO, or Constitutional AI?
How is a 'constitution' as a set of principles better or worse than implicit human preferences from RLHF for a production system?
How does alignment affect what a model refuses to do? Is this a bug or a feature - and who should control these boundaries?

Связанные уроки

aie-03-llm-fundamentals — Alignment starts from pretrained LLM internals
aie-36-fine-tuning — RLHF and DPO are alignment-stage fine-tuning
aie-33-guardrails — Constitutional AI encodes guardrails into training
aie-53-future-reasoning — RL on traces trains reasoning models
ml-50-policy-gradient — RLHF optimizes a policy via reward gradients

AI Engineering

Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

Цели урока

Understand the deep gap between a pretrained and an aligned model
Break down RLHF (InstructGPT 2022): SFT, Reward Model, PPO - the purpose of each step
Understand DPO: why it replaced RLHF in most open-source models
Learn Constitutional AI and RLAIF - how Anthropic solved alignment at scale

InstructGPT (Ouyang et al., 2022) - first public demonstration of RLHF at scale: a 1.3B model outperformed 175B GPT-3 on helpfulness
DPO (Rafailov et al., 2023) - now the standard in open-source: Llama 3, Mistral, most Hugging Face models use DPO or its variants
Constitutional AI (Anthropic, 2022) - underpins Claude: model behavior is defined by explicit principles, not just human feedback
RLAIF scales alignment: instead of 1,000 human annotators - an AI judge. 100x cost reduction at comparable quality

Предварительные знания

How an LLM works: pretraining, next-token prediction, autoregressive generation
Working knowledge of fine-tuning: training a model further on extra data
Familiarity with the concept of a loss function and gradient descent

From RL on Human Preferences to DPO: How LLM Personality Was Built

Pretraining vs Alignment: Two Completely Different Processes

Why did GPT-3 fail to follow instructions despite being a powerful model?

RLHF: InstructGPT and Teaching Models to Follow Instructions

Step	Data	Goal	Output
SFT	~13K demonstrations	Assistant response format	SFT model
RM Training	~33K comparisons	Predict human preference	Reward Model
PPO/RL	RM feedback	Maximize reward	InstructGPT

Why does RLHF need a separate Reward Model instead of direct human feedback?

DPO: RLHF Without a Reward Model - 10x Simpler

Property	RLHF	DPO
Reward model	Required (separate model)	Not needed
RL algorithm	PPO (complex)	None (cross-entropy)
Training stability	Unstable	Stable
Data pipeline	Comparisons -> RM -> PPO	(chosen, rejected) pairs directly
Implementation complexity	High	Low
Quality (in practice)	Comparable	Comparable

The main engineering advantage of DPO over RLHF for adding alignment to a model:

Constitutional AI and RLAIF: Models That Align Themselves

Method	Feedback source	Scale	Transparency
RLHF	Human annotators	Expensive, slow	Low (implicit preferences)
DPO	Human comparisons	Cheaper than RM+PPO	Medium (pairs visible)
Constitutional AI	Constitution + AI critique	Scales without humans	High (explicit principles)
RLAIF	AI model as judge	Cheap and fast	Depends on judge model

What is the main engineering advantage of Constitutional AI over standard RLHF?

Alignment is just additional training on good data

Alignment changes the objective function: from token prediction to maximizing human preferences. This is not cosmetic - it is a completely different optimization problem

A bigger model is a better-aligned model

InstructGPT 1.3B outperformed GPT-3 175B on instruction following. Alignment is more effective than scaling for assistant tasks

Key Takeaways

Pretraining = token prediction. Alignment = learning to be helpful. Different problems with different data and methods
RLHF (2022): three steps - SFT -> Reward Model -> PPO. The reward model compresses human preferences into a differentiable function
DPO (2023): eliminates the reward model and PPO, optimizes directly on (chosen/rejected) pairs. Standard for open-source 2024-2026
Constitutional AI: principles -> AI-critique -> self-revision. Alignment criteria become explicit, auditable, version-controlled
RLAIF: AI model as judge instead of humans. Scales orders of magnitude cheaper than human labeling

Вопросы для размышления

If adding alignment to a custom model for a specific product - which method to choose and why: RLHF, DPO, or Constitutional AI?
How is a 'constitution' as a set of principles better or worse than implicit human preferences from RLHF for a production system?
How does alignment affect what a model refuses to do? Is this a bug or a feature - and who should control these boundaries?

Связанные уроки

aie-03-llm-fundamentals — Alignment starts from pretrained LLM internals
aie-36-fine-tuning — RLHF and DPO are alignment-stage fine-tuning
aie-33-guardrails — Constitutional AI encodes guardrails into training
aie-53-future-reasoning — RL on traces trains reasoning models
ml-50-policy-gradient — RLHF optimizes a policy via reward gradients

Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

Цели урока

Предварительные знания

From RL on Human Preferences to DPO: How LLM Personality Was Built

Pretraining vs Alignment: Two Completely Different Processes

RLHF: InstructGPT and Teaching Models to Follow Instructions

DPO: RLHF Without a Reward Model - 10x Simpler

Constitutional AI and RLAIF: Models That Align Themselves

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки

Alignment: How Models Become Helpful - RLHF, DPO, Constitutional AI

Цели урока

Предварительные знания

From RL on Human Preferences to DPO: How LLM Personality Was Built

Pretraining vs Alignment: Two Completely Different Processes

RLHF: InstructGPT and Teaching Models to Follow Instructions

DPO: RLHF Without a Reward Model - 10x Simpler

Constitutional AI and RLAIF: Models That Align Themselves

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки