Information Theory
Shannon Entropy: the Math of Uncertainty
1948. Bell Labs. Claude Shannon publishes 'A Mathematical Theory of Communication' - 77 pages that built the digital age. The central idea: the bit. Every file compressed by ZIP, every token chosen by GPT-4, every gradient step in a neural network traces back to one formula: $H(X) = -\sum p_i \log p_i$.
- PyTorch `nn.CrossEntropyLoss` is literally Shannon entropy between the model prediction and the label
- JPEG, MP3, gzip, Brotli - engineering approximations to the lower bound $H(X)$ proven in 1948
- LLM perplexity = $2^H$ - GPT-2 small ~30, GPT-4 ~6 on good text
- Temperature in Stable Diffusion/GPT - a regulator for the entropy of the softmax output distribution
- Entropy bonus in RL (SAC, PPO, MaxEnt IRL) - preventing collapse to a single mode
Предварительные знания
- Logarithms: $\log_2 8 = 3$, $\log_b(xy) = \log_b x + \log_b y$
- Probability distribution: $\sum p_i = 1$, $p_i \geq 0$
- Expectation: $\mathbb{E}[f(X)] = \sum p_i f(x_i)$
Surprise and entropy: where the formula comes from
1948. Bell Labs. Claude Shannon publishes 'A Mathematical Theory of Communication' - 77 pages that built the digital age. The central idea: the bit. The central question: how to measure the amount of information in a message? Shannon starts with a single event.
**Surprise (self-information)** of event $x$ is $I(x) = -\log P(x)$. Three requirements pin down the formula uniquely: (1) a rare event carries more information, (2) a certain event carries zero, (3) independent events add: $I(x,y) = I(x) + I(y)$. The third requirement rules out every alternative - the logarithm is the unique function turning multiplication into addition.
**Units depend on the logarithm base:** $\log_2$ gives bits (CS, ML, communications), $\ln$ gives nats (physics, autograd). One nat = $1/\ln 2 \approx 1.443$ bits. `torch.nn.CrossEntropyLoss` computes in nats - same Shannon formula, different base.
Which event carries the most information in bits?
Entropy properties: maximum, minimum, compression
Entropy is the expected surprise over the whole distribution: $H(X) = \mathbb{E}[I(X)] = -\sum_i p_i \log p_i$. By convention $0 \log 0 = 0$. Two key properties follow from the formula: the flatter the distribution, the higher $H$, and $H \geq 0$ always.
**Maximum entropy:** for $n$ outcomes $H(X) \leq \log_2 n$, with equality at the uniform distribution. Proof - Jensen's inequality on the concave $-\log$. This is more than a fact - it is a principle: given only constraints on a system (mean, variance), the most honest distribution is the one maximizing $H$. MaxEnt generates the normal, exponential, and geometric distributions.
**Shannon source coding theorem:** $H(X) \leq L^* < H(X) + 1$, where $L^*$ is the average length of the optimal prefix code (Huffman). Entropy is the floor. Any algorithm claiming to beat it contains a bug. gzip, Brotli, JPEG, MP3 - all engineering around this bound.
Distribution $\{0.5, 0.25, 0.25\}$. What is the entropy in bits?
Cross-entropy, KL divergence, and ML loss
Let $P$ be the true distribution and $Q$ the model's prediction. **Cross-entropy** $H(P, Q) = -\sum_x P(x) \log Q(x)$ is the average surprise of an observer who expected $Q$ but reality is distributed by $P$. When $Q = P$ this collapses to $H(P)$. When $Q$ is far from $P$, cross-entropy rises.
**Decomposition:** $H(P, Q) = H(P) + D_{KL}(P \| Q)$, where $D_{KL}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)} \geq 0$. This is the heart of machine learning: $H(P)$ is the irreducible part (world noise), $D_{KL}$ is the penalty for an imperfect model. Maximum likelihood = minimize $H(P, Q)$ = close the KL gap to zero.
**`nn.CrossEntropyLoss` in one sentence:** for a one-hot label $H(P, Q) = -\log Q(\text{true class})$ - the whole sum collapses to minus the log probability of the correct class. That is exactly what PyTorch computes internally. **Perplexity = $2^H$** (or $e^H$ in nats) - the standard LLM metric. GPT-2 small $\approx 30$, GPT-4 $\approx 6$ on good text.
A model predicts perfectly: $Q = P$. What is cross-entropy $H(P, Q)$?
Takeaways
- $I(x) = -\log P(x)$ - surprise of an event: rare = informative, certain = zero, independent events add
- $H(X) = -\sum p_i \log p_i = \mathbb{E}[I(X)]$ - average surprise, the measure of distributional uncertainty
- $H(X) \leq \log_2 n$ - maximum at uniform; MaxEnt generates Gaussian, exponential, geometric
- $H$ is the floor on compression: $H \leq L^* < H+1$ (Huffman). ZIP cannot compress random noise
- $H(P, Q) = H(P) + D_{KL}(P \| Q) \geq H(P)$ - cross-entropy as ML loss; $H(P)$ is the irreducible loss
- Perplexity = $2^H$ - LLM metric; temperature API is a knob for $H$ of the softmax output
What comes next
Entropy is the alphabet of information theory. Everything else grows from it.
- Joint and conditional entropy — $H(X, Y)$ and $H(Y|X)$ - how much information remains in $Y$ once $X$ is known
- KL divergence — $D_{KL}(P \| Q)$ - the heart of VAE, RLHF, distillation
- Mutual information — $I(X; Y) = H(X) - H(X|Y)$ - universal measure of dependence
Вопросы для размышления
- A file of random bytes and a file of zeros, same size. Which has higher entropy, and why does gzip compress the second by hundreds of times but barely touch the first?
- When training a 1000-class classifier on random labels, the loss plateaus near 6.9. Why that exact number? (Hint: $\ln 1000$.)
- If the softmax temperature is raised from 1.0 to 2.0 - how does the entropy of the LLM output distribution change? Up or down, and why?
- Shannon estimated English entropy at ~1.3 bits/symbol. GPT-4 reaches cross-entropy ~1.5. What would it mean if a model achieved ~0.8 bits/symbol - a violation of Shannon's theorem or not?