Information Theory

Shannon Entropy: the Math of Uncertainty

1948. Bell Labs. Claude Shannon publishes 'A Mathematical Theory of Communication' - 77 pages that built the digital age. The central idea: the bit. Every file compressed by ZIP, every token chosen by GPT-4, every gradient step in a neural network traces back to one formula: $H(X) = -\sum p_i \log p_i$.

PyTorch `nn.CrossEntropyLoss` is literally Shannon entropy between the model prediction and the label
JPEG, MP3, gzip, Brotli - engineering approximations to the lower bound $H(X)$ proven in 1948
LLM perplexity = $2^H$ - GPT-2 small ~30, GPT-4 ~6 on good text
Temperature in Stable Diffusion/GPT - a regulator for the entropy of the softmax output distribution
Entropy bonus in RL (SAC, PPO, MaxEnt IRL) - preventing collapse to a single mode

Предварительные знания

Logarithms: $\log_2 8 = 3$, $\log_b(xy) = \log_b x + \log_b y$
Probability distribution: $\sum p_i = 1$, $p_i \geq 0$
Expectation: $\mathbb{E}[f(X)] = \sum p_i f(x_i)$

Surprise and entropy: where the formula comes from

1948. Bell Labs. Claude Shannon publishes 'A Mathematical Theory of Communication' - 77 pages that built the digital age. The central idea: the bit. The central question: how to measure the amount of information in a message? Shannon starts with a single event.

**Surprise (self-information)** of event $x$ is $I(x) = -\log P(x)$. Three requirements pin down the formula uniquely: (1) a rare event carries more information, (2) a certain event carries zero, (3) independent events add: $I(x,y) = I(x) + I(y)$. The third requirement rules out every alternative - the logarithm is the unique function turning multiplication into addition.

**Units depend on the logarithm base:** $\log_2$ gives bits (CS, ML, communications), $\ln$ gives nats (physics, autograd). One nat = $1/\ln 2 \approx 1.443$ bits. `torch.nn.CrossEntropyLoss` computes in nats - same Shannon formula, different base.

Which event carries the most information in bits?

Entropy properties: maximum, minimum, compression

Entropy is the expected surprise over the whole distribution: $H(X) = \mathbb{E}[I(X)] = -\sum_i p_i \log p_i$. By convention $0 \log 0 = 0$. Two key properties follow from the formula: the flatter the distribution, the higher $H$, and $H \geq 0$ always.

**Maximum entropy:** for $n$ outcomes $H(X) \leq \log_2 n$, with equality at the uniform distribution. Proof - Jensen's inequality on the concave $-\log$. This is more than a fact - it is a principle: given only constraints on a system (mean, variance), the most honest distribution is the one maximizing $H$. MaxEnt generates the normal, exponential, and geometric distributions.

**Shannon source coding theorem:** $H(X) \leq L^* < H(X) + 1$, where $L^*$ is the average length of the optimal prefix code (Huffman). Entropy is the floor. Any algorithm claiming to beat it contains a bug. gzip, Brotli, JPEG, MP3 - all engineering around this bound.

Distribution $\{0.5, 0.25, 0.25\}$. What is the entropy in bits?

Cross-entropy, KL divergence, and ML loss

Let $P$ be the true distribution and $Q$ the model's prediction. **Cross-entropy** $H(P, Q) = -\sum_x P(x) \log Q(x)$ is the average surprise of an observer who expected $Q$ but reality is distributed by $P$. When $Q = P$ this collapses to $H(P)$. When $Q$ is far from $P$, cross-entropy rises.

**Decomposition:** $H(P, Q) = H(P) + D_{KL}(P \| Q)$, where $D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \geq 0$. This is the heart of machine learning: $H(P)$ is the irreducible part (world noise), $D_{KL}$ is the penalty for an imperfect model. Maximum likelihood = minimize $H(P, Q)$ = close the KL gap to zero.

**`nn.CrossEntropyLoss` in one sentence:** for a one-hot label $H(P, Q) = -\log Q(\text{true class})$ - the whole sum collapses to minus the log probability of the correct class. That is exactly what PyTorch computes internally. **Perplexity = $2^H$** (or $e^H$ in nats) - the standard LLM metric. GPT-2 small $\approx 30$, GPT-4 $\approx 6$ on good text.

A model predicts perfectly: $Q = P$. What is cross-entropy $H(P, Q)$?

Takeaways

$I(x) = -\log P(x)$ - surprise of an event: rare = informative, certain = zero, independent events add
$H(X) = -\sum p_i \log p_i = \mathbb{E}[I(X)]$ - average surprise, the measure of distributional uncertainty
$H(X) \leq \log_2 n$ - maximum at uniform; MaxEnt generates Gaussian, exponential, geometric
$H$ is the floor on compression: $H \leq L^* < H+1$ (Huffman). ZIP cannot compress random noise
$H(P, Q) = H(P) + D_{KL}(P \| Q) \geq H(P)$ - cross-entropy as ML loss; $H(P)$ is the irreducible loss
Perplexity = $2^H$ - LLM metric; temperature API is a knob for $H$ of the softmax output

What comes next

Entropy is the alphabet of information theory. Everything else grows from it.

Joint and conditional entropy — $H(X, Y)$ and $H(Y|X)$ - how much information remains in $Y$ once $X$ is known
KL divergence — $D_{KL}(P \| Q)$ - the heart of VAE, RLHF, distillation
Mutual information — $I(X; Y) = H(X) - H(X|Y)$ - universal measure of dependence

Вопросы для размышления

A file of random bytes and a file of zeros, same size. Which has higher entropy, and why does gzip compress the second by hundreds of times but barely touch the first?
When training a 1000-class classifier on random labels, the loss plateaus near 6.9. Why that exact number? (Hint: $\ln 1000$.)
If the softmax temperature is raised from 1.0 to 2.0 - how does the entropy of the LLM output distribution change? Up or down, and why?
Shannon estimated English entropy at ~1.3 bits/symbol. GPT-4 reaches cross-entropy ~1.5. What would it mean if a model achieved ~0.8 bits/symbol - a violation of Shannon's theorem or not?

Связанные уроки

Surprise and entropy: where the formula comes from

Which event carries the most information in bits?

Entropy properties: maximum, minimum, compression

Distribution $\{0.5, 0.25, 0.25\}$. What is the entropy in bits?

Cross-entropy, KL divergence, and ML loss

A model predicts perfectly: $Q = P$. What is cross-entropy $H(P, Q)$?

Takeaways

$I(x) = -\log P(x)$ - surprise of an event: rare = informative, certain = zero, independent events add

$H(X) = -\sum p_i \log p_i = \mathbb{E}[I(X)]$ - average surprise, the measure of distributional uncertainty

$H(X) \leq \log_2 n$ - maximum at uniform; MaxEnt generates Gaussian, exponential, geometric

$H$ is the floor on compression: $H \leq L^* < H+1$ (Huffman). ZIP cannot compress random noise

$H(P, Q) = H(P) + D_{KL}(P \| Q) \geq H(P)$ - cross-entropy as ML loss; $H(P)$ is the irreducible loss

Perplexity = $2^H$ - LLM metric; temperature API is a knob for $H$ of the softmax output