Information Theory

Information Theory in Deep Learning

Why does a neural network with millions of parameters generalize at all? Classical bounds (VC dimension, Rademacher complexity) give vacuous guarantees for overparameterized models. PAC-Bayes bounds, expressed in terms of KL divergence, are much tighter.

Bounding the test error of a fine-tuned LLM without running inference on test data
Certificates for safety-critical systems using neural networks
Understanding why regularization improves generalization

MDL: From Kolmogorov to Neural Networks

Jorma Rissanen introduced MDL in 1978 as a formalization of Occam's Razor: the best model is the simplest one that explains the data. Grounded in Kolmogorov complexity and algorithmic information theory, MDL offers a coding-theoretic view of model selection.

PAC-Bayes and Generalization Bounds

The PAC-Bayes theorem bounds the expected test error of a randomized predictor drawn from a posterior Q in terms of the KL divergence from a prior P.

Recent work by Dziugaite & Roy (2017) computed non-vacuous PAC-Bayes bounds for MNIST networks by optimizing the posterior Q to minimize the bound directly-showing that tight PAC-Bayes bounds are achievable, not just theoretical.

According to PAC-Bayes, what determines how well a trained model will generalize?

Minimum Description Length

MDL says: choose the model that minimizes the total two-part code length: the bits to describe the model itself, plus the bits to describe the data given the model.

Bayesian inference and MDL are different approaches-one probabilistic, one coding-based

Bayesian MAP and MDL give identical solutions. A prior p(W) is a probability distribution and also a description length code via -log p(W). The choice of prior is the choice of compression scheme.

Shannon showed that the optimal code for a distribution p assigns length -log p(x) to symbol x. So every probability distribution is a code, and every code is implicitly a prior. Bayes and MDL are two languages for the same concept.

Why does L2 regularization (weight decay) correspond to a Gaussian prior in the MDL framework?

Information Geometry and Natural Gradient

Information geometry studies the space of probability distributions as a Riemannian manifold, where the metric is given by the Fisher information matrix. This leads to the natural gradient-an optimization method that accounts for the geometry of the parameter space.

Adam's adaptive learning rates approximate the inverse diagonal Fisher information. This is why Adam works well across very different architectures and tasks-it's implicitly accounting for the geometry of the loss landscape.

Why is the natural gradient preferred over the Euclidean gradient for neural network training?

Information Theory in Transformers

Attention mechanisms, layer normalization, and residual connections all have information-theoretic interpretations. Understanding these connections illuminates why transformers work and how to improve them.

The relationship between model scale and compression efficiency follows from information theory. Larger models have more capacity to represent the data distribution p(text), enabling lower cross-entropy. Scaling laws predict this relationship: test loss decreases as a power law in model size.

Information Theory in Transformers?

Review the concept above.

Information Theory in Deep Learning: Key Takeaways

PAC-Bayes bounds generalization via KL(Q||P)-models close to their prior generalize better
MDL = MAP inference = regularization: all minimize description length
Fisher information gives the Riemannian geometry of probability distributions
Natural gradient = F⁻¹·∇L; Adam approximates this with diagonal Fisher
Attention entropy, residual connections, and layer normalization all relate to information routing and preservation

Toward a Theory of Deep Learning

Information theory provides perhaps the most coherent partial theory of why deep learning works. But many puzzles remain: why do overparameterized networks generalize, and what makes attention so powerful? These are active research questions.

it-11 — Related lesson
it-14 — Related lesson

Вопросы для размышления

If Adam is an approximate natural gradient, what property of the loss landscape makes it adaptive per-parameter in a way that helps training?
Residual connections preserve I(x_0; x_L). But is preserving the input information always good? Can you think of cases where it might hurt?
The PAC-Bayes bound improves as KL(Q||P) shrinks. What training procedure would explicitly minimize this KL while training?

Связанные уроки