Information Theory

KL-Divergence and Cross-Entropy

1951: Kullback and Leibler measure languages for the NSA. 2017: the same formula determines the cost of training GPT. 2024: DPO uses KL as the RLHF collapse penalty. One idea, three generations of applications. Cross-entropy loss isn't 'a chosen loss function' - it's the uniquely correct answer from information theory.

**nn.CrossEntropyLoss** in PyTorch: $H(P, Q)$ where P is one-hot target, Q is softmax output. Every classification training step
**VAE**: ELBO = E[log p(x|z)] - $D_{KL}(q(z|x) \| p(z))$. The KL term regularizes the latent space
**RLHF/DPO**: $D_{KL}(\pi_{\theta} \| \pi_{ref})$ penalizes deviation from the reference model during fine-tuning
**Perplexity**: standard LLM metric, $2^{H(P,Q)}$. GPT-4 ~6-8 on English text vs GPT-2 ~30

Предварительные знания

Joint and Conditional Entropy

KL-Divergence

1951. Solomon Kullback and Richard Leibler work at the NSA on cryptanalysis. The problem: an intercepted ciphertext. Is it German? Russian? They need a measure of how much the letter-frequency distribution in the text differs from a reference language distribution.

They invented **KL-divergence** - a measure of the 'cost' of using the wrong model. If the true distribution is P and the model is Q, then $D_{KL}(P \| Q)$ is how many extra bits per symbol are spent when encoding data from P using a code optimized for Q.

Asymmetry is a feature, not a bug. $D_{KL}(P \| Q)$ and $D_{KL}(Q \| P)$ answer different questions. In ML we usually minimize $D_{KL}(P_{data} \| P_{model})$: punish the model for failing to cover real data. If $P_{data}$ has a mode the model considers impossible, the penalty is infinite.

**Label smoothing** in classifier training is a direct response to this infinity. With one-hot targets, if the model ever assigns Q(correct class) = 0, the loss is infinite. Fix: replace P = [0, 0, 1, 0] with P = [0.05, 0.05, 0.85, 0.05] - now all q(x) > 0 are reachable.

D_KL(P || Q) = 0.5 bits, D_KL(Q || P) = 0.8 bits. Which statement is true?

MI as KL-Divergence

In the previous lesson, mutual information I(X;Y) looked like just another formula. Now it can be stated precisely: MI is the KL-divergence between the real joint distribution and the 'world of independence'.

This explains why I(X;Y) ≥ 0 always: it follows from Gibbs' inequality for KL. And why MI is symmetric (I(X;Y) = I(Y;X)): the formula D_KL(P(X,Y) || P(X)P(Y)) is symmetric in X and Y - unlike D_KL(P||Q) where the order of P and Q matters.

**Information Bottleneck** (Tishby, 2017): a neural network as channel X → Z → Y. Good training: maximize I(Z; Y) (the representation is informative about the label) and minimize I(Z; X) (the representation compresses the input). The entire story of feature learning expressed in two MI terms.

What does I(X;Y) = D_KL(P(X,Y) || P(X)P(Y)) measure?

Cross-Entropy and the PyTorch Loss

Every call to `loss = nn.CrossEntropyLoss()(logits, targets)` computes one formula. Not a heuristic, not something invented for ML - the coding cost of data P under model Q. Kullback and Leibler derived it in 1951.

**Perplexity** = $2^{H(P,Q)}$ - exponentiated cross-entropy. Perplexity 10 means: the model is on average weighing 10 equally likely options at each step. GPT-2 small hits around 30 on WikiText-103. GPT-4 is in the 6-8 range. A factor of 4 in numbers translates to an exponential gap in practice.

Quantity	Formula	Role in ML
H(P)	-Σ p log p	Theoretical minimum loss (data-dependent, model-independent)
H(P, Q)	-Σ p log q	Cross-entropy loss (what we minimize)
D_KL(P\|\|Q)	H(P,Q) - H(P)	How much worse than the theoretical minimum
Perplexity	$2^{H(P,Q)}$	LLM metric: 6 = 'choosing among 6 options on average'

Why is minimizing H(P, Q) over Q equivalent to minimizing D_KL(P || Q)?

f-Divergences: GAN, VAE, WGAN

KL is one of infinitely many ways to measure distance between distributions. The whole family is called **f-divergences**: a convex function f(t) defines the rule. The choice of f is the choice of what counts as 'error'.

**Forward KL** (mode-covering): penalizes missed modes - the model must cover all the data. Used in VAE and classifier training. **Reverse KL** (mode-seeking): penalizes hallucinations - better to be silent than wrong. Used in variational inference.

The original GAN (Goodfellow, 2014) minimizes Jensen-Shannon divergence. WGAN (Arjovsky, 2017) switched to Wasserstein distance - not an f-divergence, but a similar idea. Reason: when generator and data distributions don't overlap (common early in training), KL and JSD degenerate to infinity or a constant. Wasserstein remains informative.

Divergence	Symmetric?	Metric?	Application
KL (forward)	No	No	CrossEntropyLoss, VAE (ELBO)
KL (reverse)	No	No	Variational inference
Jensen-Shannon	Yes	Yes (√JSD)	Original GAN
Total Variation	Yes	Yes	Theoretical bounds, privacy
Wasserstein	Yes	Yes	WGAN, optimal transport

**DPO** (Direct Preference Optimization, 2023) in RLHF is also built on KL: $D_{KL}(\pi_{\theta} \| \pi_{ref})$ penalizes deviation from the reference policy. This prevents the model from collapsing into a high-reward-but-degenerate policy that drifted too far from the pretrained base.

KL-divergence is a distance metric between distributions

KL is not a metric: it's asymmetric and violates the triangle inequality. It's a measure of informational cost.

A metric d(P,Q) must be symmetric, satisfy the triangle inequality, and equal zero only when P=Q. KL violates the first two. For a true metric use √JSD or Total Variation. For geometric properties use Wasserstein distance.

Why did WGAN switch from Jensen-Shannon divergence to Wasserstein distance?

Key ideas

**D_KL(P||Q)**: extra bits needed when encoding P using a code for Q. Asymmetric, always ≥ 0
**I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))**: mutual information as KL from the joint to the product of marginals
**H(P,Q) = H(P) + D_KL(P||Q)**: cross-entropy = data entropy + KL. Minimizing CE ≡ minimizing KL
**f-divergences**: KL, JSD, Wasserstein - different definitions of 'error'. The choice shapes what the model learns to do

Вопросы для размышления

Why does VAE use KL(q(z|x) || p(z)) and not the reverse? What changes when the arguments are flipped?
A model has perplexity = 100. What does that mean intuitively? When can perplexity be a misleading metric?
Forward KL (mode-covering) vs Reverse KL (mode-seeking): which is better for text generation? For medical diagnosis?

Связанные уроки

Information Theory

KL-Divergence and Cross-Entropy

**nn.CrossEntropyLoss** in PyTorch: $H(P, Q)$ where P is one-hot target, Q is softmax output. Every classification training step
**VAE**: ELBO = E[log p(x|z)] - $D_{KL}(q(z|x) \| p(z))$. The KL term regularizes the latent space
**RLHF/DPO**: $D_{KL}(\pi_{\theta} \| \pi_{ref})$ penalizes deviation from the reference model during fine-tuning
**Perplexity**: standard LLM metric, $2^{H(P,Q)}$. GPT-4 ~6-8 on English text vs GPT-2 ~30

Предварительные знания

Joint and Conditional Entropy

KL-Divergence

D_KL(P || Q) = 0.5 bits, D_KL(Q || P) = 0.8 bits. Which statement is true?

MI as KL-Divergence

What does I(X;Y) = D_KL(P(X,Y) || P(X)P(Y)) measure?

Cross-Entropy and the PyTorch Loss

Quantity	Formula	Role in ML
H(P)	-Σ p log p	Theoretical minimum loss (data-dependent, model-independent)
H(P, Q)	-Σ p log q	Cross-entropy loss (what we minimize)
D_KL(P\|\|Q)	H(P,Q) - H(P)	How much worse than the theoretical minimum
Perplexity	$2^{H(P,Q)}$	LLM metric: 6 = 'choosing among 6 options on average'

Why is minimizing H(P, Q) over Q equivalent to minimizing D_KL(P || Q)?

f-Divergences: GAN, VAE, WGAN

Divergence	Symmetric?	Metric?	Application
KL (forward)	No	No	CrossEntropyLoss, VAE (ELBO)
KL (reverse)	No	No	Variational inference
Jensen-Shannon	Yes	Yes (√JSD)	Original GAN
Total Variation	Yes	Yes	Theoretical bounds, privacy
Wasserstein	Yes	Yes	WGAN, optimal transport

KL-divergence is a distance metric between distributions

KL is not a metric: it's asymmetric and violates the triangle inequality. It's a measure of informational cost.

Why did WGAN switch from Jensen-Shannon divergence to Wasserstein distance?

Key ideas

**D_KL(P||Q)**: extra bits needed when encoding P using a code for Q. Asymmetric, always ≥ 0
**I(X;Y) = D_KL(P(X,Y) || P(X)P(Y))**: mutual information as KL from the joint to the product of marginals
**H(P,Q) = H(P) + D_KL(P||Q)**: cross-entropy = data entropy + KL. Minimizing CE ≡ minimizing KL
**f-divergences**: KL, JSD, Wasserstein - different definitions of 'error'. The choice shapes what the model learns to do

Вопросы для размышления

Why does VAE use KL(q(z|x) || p(z)) and not the reverse? What changes when the arguments are flipped?
A model has perplexity = 100. What does that mean intuitively? When can perplexity be a misleading metric?
Forward KL (mode-covering) vs Reverse KL (mode-seeking): which is better for text generation? For medical diagnosis?

KL-Divergence and Cross-Entropy

Предварительные знания

KL-Divergence

MI as KL-Divergence

Cross-Entropy and the PyTorch Loss

f-Divergences: GAN, VAE, WGAN

Key ideas

Related topics

Вопросы для размышления

Связанные уроки

KL-Divergence and Cross-Entropy

Предварительные знания

KL-Divergence

MI as KL-Divergence

Cross-Entropy and the PyTorch Loss

f-Divergences: GAN, VAE, WGAN

Key ideas

Related topics

Вопросы для размышления

Связанные уроки