Information Theory
Information theory in machine learning
OpenAI's CLIP aligns images and text via InfoNCE, a direct application of the mutual information theorem. GPT-4, VAE, diffusion models. They all implement information-theoretic principles.
- **VAE and beta-VAE** (DeepMind) used the IB principle for disentangled representations, separating 'shape' from 'color' in latent space.
- **CLIP** (OpenAI) is trained to maximize I(image; text) through InfoNCE. That is why CLIP understands arbitrary image captions.
- **Neural compression** (Google Balle, 2020) explicitly optimizes the R-D tradeoff via VAE plus entropy coding. It beats JPEG at low bitrates.
Предварительные знания
Variational inference and ELBO
Variational inference solves a hard problem: how to approximate the intractable posterior p(z | x)? The idea is to pick a 'simple' family of distributions q(z | x) and find the member closest to p(z | x) in KL divergence. Minimizing KL[q || p] is equivalent to maximizing the Evidence Lower Bound (ELBO), and that is where information theory enters: ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)].
**ELBO:** log p(x) = ELBO + KL[q(z | x) || p(z | x)] >= ELBO. ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)]. The first term is reconstruction quality (data likelihood). The second is regularization (proximity to the prior). Equality holds when q = p(z | x): ELBO = log p(x).
| ELBO component | Information-theoretic meaning | What it optimizes |
|---|---|---|
| E_q[log p(x | z)] | -H(X | Z), part of mutual information | Reconstruction quality |
| -KL[q || p] | Regularization, proximity to the prior | Latent space structure |
| ELBO overall | Lower bound on log p(x) | Data likelihood |
| KL[q || p(z | x)] | Approximation tightness | How close q is to the truth |
Historical note
Variational inference was a statistical tool long before deep learning. It became mainstream in DL with the VAE (Kingma and Welling, 2013), the first scalable generative method trained explicitly on ELBO. Since then ELBO has been the standard tool for generative models and Bayesian deep learning.
VAE minimizes reconstruction loss the same way as a plain autoencoder.
VAE maximizes ELBO = reconstruction - KL[q || prior]. KL regularization fundamentally changes the objective: not just exact reconstruction but also a structured latent space.
Without KL the VAE degenerates into a deterministic autoencoder. KL ensures z can be sampled from the prior and still give meaningful reconstructions.
The KL term in ELBO, KL[q(z | x) || p(z)], is always >= 0. What happens to training if you drop it (beta = 0 in beta-VAE)?
ELBO as coding
ELBO has a beautiful Minimum Description Length (MDL) interpretation: the lower bound on log p(x) is the minimum description length of x. The first ELBO term is the code length of x given z (reconstruction). The second is the code length of z (KL from the prior). Together: ELBO = -(description of z) - (description of x given z). Maximizing ELBO is minimizing the total code length.
**Coding interpretation of ELBO:** -log p(x) >= KL[q(z | x) || p(z)] + E_q[-log p(x | z)]. The right-hand side is the code length of z under the prior plus the code length of x given z. A two-part code: first z, then x. MDL principle: a good model is a good compressor.
| Method | Information principle | Optimization target |
|---|---|---|
| VAE | ELBO = MDL two-part code | log p(x) lower bound |
| beta-VAE | beta * KL + reconstruction | Disentanglement |
| VQ-VAE | Discrete codebook = Huffman | Discrete representations |
| InfoVAE | ELBO + mutual information term | z-structure quality |
Historical note
The link between ELBO and MDL goes back to Rissanen (1978) and Wallace (1968). Its neural-network application was crystallized by Hinton in the 'Wake-Sleep' algorithm (1995) and later in the VAE (2013). Understanding ELBO as compression opens the door to neural compression.
Small KL in a VAE is a sign of a good model.
Small KL can mean posterior collapse (q ~ prior), which destroys reconstruction. The optimum is a balance between KL and reconstruction.
Minimizing only KL is trivial: set q = prior. But then z carries no information about x and reconstruction is impossible.
VAE with KL = 50 nats and reconstruction loss = 30 nats versus a model with KL = 10 nats and reconstruction = 100 nats. Which has the better ELBO?
Information Bottleneck
Information Bottleneck (Tishby, 2000) is a learning principle: find a representation Z of the input X that is maximally informative about the target Y at minimum complexity (minimum mutual information with X). It is a tradeoff: I(Z; Y) maximal, I(Z; X) minimal. The Lagrangian is max I(Z; Y) - beta * I(Z; X). Connection to VAE: under appropriate assumptions VAE realizes IB.
**Information Bottleneck:** L_IB = I(Z; Y) - beta * I(Z; X) = max. At beta -> 0: Z keeps all information about X. At beta -> infinity: Z is the minimal sufficient statistic for Y. Markov chain: Y - X - Z (Z is defined only through X). Markov bound: I(Z; Y) <= I(X; Y). Z cannot know Y better than X does.
| beta | Preferred Z | DL example |
|---|---|---|
| beta -> 0 | Z = X (no compression) | Autoencoder without regularization |
| beta ~ 1 | Balance of usefulness and compression | VAE, standard training |
| beta >> 1 | Z is a minimal Y-statistic | Strong IB, feature selection |
| beta -> infinity | Z = const (maximum compression) | Input is ignored |
Historical note
Naftali Tishby proposed IB in 2000. In 2017 he and Schweighofer put forward the controversial 'information plane' hypothesis, that neural networks pass through compression and fitting phases. The hypothesis sparked debate: Saxe et al. showed that the effect is an artifact of the activation function.
Information Bottleneck proves neural networks compress information about X during training.
Tishby's compression-phase hypothesis is debated and depends on activation function and on how I(Z; X) is measured.
Saxe et al. (2018) showed that with ReLU activations no compression is observed. The effect is specific to saturating activations like tanh and sigmoid.
In Information Bottleneck the Markov chain Y - X - Z means:
Mutual information estimation
Mutual information I(X; Y) is hard to compute in continuous high-dimensional spaces. Neural estimators address this: MINE (Mutual Information Neural Estimation), CLUB (Contrastive Log-ratio Upper Bound), InfoNCE, all are lower or upper bounds on I(X; Y) implemented as neural networks. The key application is contrastive learning (SimCLR, CLIP), which is maximization of I(X; Z).
**MINE:** I(X; Y) >= E_{p(x, y)}[T(x, y)] - log E_{p(x) p(y)}[e^{T(x, y)}], where T is the neural estimator. Maximizing over T approaches I(X; Y). **InfoNCE:** I(X; Y) >= log(N) - L_NCE, where L_NCE is the contrastive loss. Used in SimCLR and CLIP for representation learning.
| Method | Type | Use case | Cost |
|---|---|---|---|
| MINE | Lower bound | Network diagnostics | O(N^2) |
| InfoNCE | Lower bound | SimCLR, CLIP | O(N^2) |
| CLUB | Upper bound | IB minimization | O(N^2) |
| KSG | Non-parametric | Low-dim cases | O(N log N) |
Historical note
MINE (Belghazi et al., 2018) showed that mutual information can be estimated with a neural network. InfoNCE (van den Oord, 2018) became the backbone of contrastive learning. CLIP (OpenAI, 2021) uses InfoNCE to align images and text, a clean product-grade application of information-theoretic ideas in ML.
InfoNCE loss minimizes I(X; Z) for better invariance.
InfoNCE minimizes -I(X; Z), equivalently maximizes I(X; Z). The goal is maximal mutual information between input and representation.
Contrastive learning wants representations to carry maximum information about the source data (positive pairs) and minimum about irrelevant data (negatives).
SimCLR learns representations by maximizing I(X; Z). Why use InfoNCE instead of computing I(X; Z) directly?
Takeaways
- **ELBO** = E_q[log p(x | z)] - KL[q || p] is a lower bound on log p(x). Maximizing ELBO minimizes a two-part code (MDL).
- **Information Bottleneck:** max I(Z; Y) - beta * I(Z; X). Tradeoff: Z must be sufficient for Y and minimal relative to X.
- **MINE/InfoNCE:** neural estimators of mutual information. InfoNCE <= I(X; Y). Contrastive learning = maximizing I.
- **VAE = IB:** under appropriate assumptions VAE solves the IB problem. ELBO is the learning objective; it is an information principle.
Related topics
IT in ML links theoretical concepts to practical training methods.
- Rate-Distortion Theory — ELBO is the R-D tradeoff for coding data. Neural codecs are explicit R-D optimization.
- Information theory in deep learning — PAC-Bayes, generalization, information plane: the next layer up.
- Data compression: JPEG, H.265, LLM — Neural compression is VAE plus entropy coding for media.
Вопросы для размышления
- VAE and contrastive learning are different ways to maximize I(X; Z). What is the principled difference in their strategies?
- If beta -> infinity in beta-VAE, what happens to reconstruction? Why does that correspond to maximum 'compression' in Information Bottleneck?
- CLIP trains on billions of image-text pairs. From an information-theoretic point of view, what is stored in CLIP weights after training?