Information Theory

Information theory in machine learning

OpenAI's CLIP aligns images and text via InfoNCE, a direct application of the mutual information theorem. GPT-4, VAE, diffusion models. They all implement information-theoretic principles.

**VAE and beta-VAE** (DeepMind) used the IB principle for disentangled representations, separating 'shape' from 'color' in latent space.
**CLIP** (OpenAI) is trained to maximize I(image; text) through InfoNCE. That is why CLIP understands arbitrary image captions.
**Neural compression** (Google Balle, 2020) explicitly optimizes the R-D tradeoff via VAE plus entropy coding. It beats JPEG at low bitrates.

Предварительные знания

KL-Divergence and Cross-Entropy

Variational inference and ELBO

Variational inference solves a hard problem: how to approximate the intractable posterior p(z | x)? The idea is to pick a 'simple' family of distributions q(z | x) and find the member closest to p(z | x) in KL divergence. Minimizing KL[q || p] is equivalent to maximizing the Evidence Lower Bound (ELBO), and that is where information theory enters: ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)].

**ELBO:** log p(x) = ELBO + KL[q(z | x) || p(z | x)] >= ELBO. ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)]. The first term is reconstruction quality (data likelihood). The second is regularization (proximity to the prior). Equality holds when q = p(z | x): ELBO = log p(x).

ELBO component	Information-theoretic meaning	What it optimizes
E_q[log p(x \| z)]	-H(X \| Z), part of mutual information	Reconstruction quality
-KL[q \|\| p]	Regularization, proximity to the prior	Latent space structure
ELBO overall	Lower bound on log p(x)	Data likelihood
KL[q \|\| p(z \| x)]	Approximation tightness	How close q is to the truth

Historical note

Variational inference was a statistical tool long before deep learning. It became mainstream in DL with the VAE (Kingma and Welling, 2013), the first scalable generative method trained explicitly on ELBO. Since then ELBO has been the standard tool for generative models and Bayesian deep learning.

VAE minimizes reconstruction loss the same way as a plain autoencoder.

VAE maximizes ELBO = reconstruction - KL[q || prior]. KL regularization fundamentally changes the objective: not just exact reconstruction but also a structured latent space.

Without KL the VAE degenerates into a deterministic autoencoder. KL ensures z can be sampled from the prior and still give meaningful reconstructions.

The KL term in ELBO, KL[q(z | x) || p(z)], is always >= 0. What happens to training if you drop it (beta = 0 in beta-VAE)?

ELBO as coding

ELBO has a beautiful Minimum Description Length (MDL) interpretation: the lower bound on log p(x) is the minimum description length of x. The first ELBO term is the code length of x given z (reconstruction). The second is the code length of z (KL from the prior). Together: ELBO = -(description of z) - (description of x given z). Maximizing ELBO is minimizing the total code length.

**Coding interpretation of ELBO:** -log p(x) >= KL[q(z | x) || p(z)] + E_q[-log p(x | z)]. The right-hand side is the code length of z under the prior plus the code length of x given z. A two-part code: first z, then x. MDL principle: a good model is a good compressor.

Method	Information principle	Optimization target
VAE	ELBO = MDL two-part code	log p(x) lower bound
beta-VAE	beta * KL + reconstruction	Disentanglement
VQ-VAE	Discrete codebook = Huffman	Discrete representations
InfoVAE	ELBO + mutual information term	z-structure quality

Historical note

The link between ELBO and MDL goes back to Rissanen (1978) and Wallace (1968). Its neural-network application was crystallized by Hinton in the 'Wake-Sleep' algorithm (1995) and later in the VAE (2013). Understanding ELBO as compression opens the door to neural compression.

Small KL in a VAE is a sign of a good model.

Small KL can mean posterior collapse (q ~ prior), which destroys reconstruction. The optimum is a balance between KL and reconstruction.

Minimizing only KL is trivial: set q = prior. But then z carries no information about x and reconstruction is impossible.

VAE with KL = 50 nats and reconstruction loss = 30 nats versus a model with KL = 10 nats and reconstruction = 100 nats. Which has the better ELBO?

Information Bottleneck

Information Bottleneck (Tishby, 2000) is a learning principle: find a representation Z of the input X that is maximally informative about the target Y at minimum complexity (minimum mutual information with X). It is a tradeoff: I(Z; Y) maximal, I(Z; X) minimal. The Lagrangian is max I(Z; Y) - beta * I(Z; X). Connection to VAE: under appropriate assumptions VAE realizes IB.

**Information Bottleneck:** L_IB = I(Z; Y) - beta * I(Z; X) = max. At beta -> 0: Z keeps all information about X. At beta -> infinity: Z is the minimal sufficient statistic for Y. Markov chain: Y - X - Z (Z is defined only through X). Markov bound: I(Z; Y) <= I(X; Y). Z cannot know Y better than X does.

beta	Preferred Z	DL example
beta -> 0	Z = X (no compression)	Autoencoder without regularization
beta ~ 1	Balance of usefulness and compression	VAE, standard training
beta >> 1	Z is a minimal Y-statistic	Strong IB, feature selection
beta -> infinity	Z = const (maximum compression)	Input is ignored

Historical note

Naftali Tishby proposed IB in 2000. In 2017 he and Schweighofer put forward the controversial 'information plane' hypothesis, that neural networks pass through compression and fitting phases. The hypothesis sparked debate: Saxe et al. showed that the effect is an artifact of the activation function.

Information Bottleneck proves neural networks compress information about X during training.

Tishby's compression-phase hypothesis is debated and depends on activation function and on how I(Z; X) is measured.

Saxe et al. (2018) showed that with ReLU activations no compression is observed. The effect is specific to saturating activations like tanh and sigmoid.

In Information Bottleneck the Markov chain Y - X - Z means:

Mutual information estimation

Mutual information I(X; Y) is hard to compute in continuous high-dimensional spaces. Neural estimators address this: MINE (Mutual Information Neural Estimation), CLUB (Contrastive Log-ratio Upper Bound), InfoNCE, all are lower or upper bounds on I(X; Y) implemented as neural networks. The key application is contrastive learning (SimCLR, CLIP), which is maximization of I(X; Z).

**MINE:** I(X; Y) >= E_{p(x, y)}[T(x, y)] - log E_{p(x) p(y)}[e^{T(x, y)}], where T is the neural estimator. Maximizing over T approaches I(X; Y). **InfoNCE:** I(X; Y) >= log(N) - L_NCE, where L_NCE is the contrastive loss. Used in SimCLR and CLIP for representation learning.

Method	Type	Use case	Cost
MINE	Lower bound	Network diagnostics	O(N^2)
InfoNCE	Lower bound	SimCLR, CLIP	O(N^2)
CLUB	Upper bound	IB minimization	O(N^2)
KSG	Non-parametric	Low-dim cases	O(N log N)

Historical note

MINE (Belghazi et al., 2018) showed that mutual information can be estimated with a neural network. InfoNCE (van den Oord, 2018) became the backbone of contrastive learning. CLIP (OpenAI, 2021) uses InfoNCE to align images and text, a clean product-grade application of information-theoretic ideas in ML.

InfoNCE loss minimizes I(X; Z) for better invariance.

InfoNCE minimizes -I(X; Z), equivalently maximizes I(X; Z). The goal is maximal mutual information between input and representation.

Contrastive learning wants representations to carry maximum information about the source data (positive pairs) and minimum about irrelevant data (negatives).

SimCLR learns representations by maximizing I(X; Z). Why use InfoNCE instead of computing I(X; Z) directly?

Takeaways

**ELBO** = E_q[log p(x | z)] - KL[q || p] is a lower bound on log p(x). Maximizing ELBO minimizes a two-part code (MDL).
**Information Bottleneck:** max I(Z; Y) - beta * I(Z; X). Tradeoff: Z must be sufficient for Y and minimal relative to X.
**MINE/InfoNCE:** neural estimators of mutual information. InfoNCE <= I(X; Y). Contrastive learning = maximizing I.
**VAE = IB:** under appropriate assumptions VAE solves the IB problem. ELBO is the learning objective; it is an information principle.

Вопросы для размышления

VAE and contrastive learning are different ways to maximize I(X; Z). What is the principled difference in their strategies?
If beta -> infinity in beta-VAE, what happens to reconstruction? Why does that correspond to maximum 'compression' in Information Bottleneck?
CLIP trains on billions of image-text pairs. From an information-theoretic point of view, what is stored in CLIP weights after training?

Связанные уроки

Information Theory

Information theory in machine learning

OpenAI's CLIP aligns images and text via InfoNCE, a direct application of the mutual information theorem. GPT-4, VAE, diffusion models. They all implement information-theoretic principles.

**VAE and beta-VAE** (DeepMind) used the IB principle for disentangled representations, separating 'shape' from 'color' in latent space.
**CLIP** (OpenAI) is trained to maximize I(image; text) through InfoNCE. That is why CLIP understands arbitrary image captions.
**Neural compression** (Google Balle, 2020) explicitly optimizes the R-D tradeoff via VAE plus entropy coding. It beats JPEG at low bitrates.

Предварительные знания

KL-Divergence and Cross-Entropy

Variational inference and ELBO

ELBO component	Information-theoretic meaning	What it optimizes
E_q[log p(x \| z)]	-H(X \| Z), part of mutual information	Reconstruction quality
-KL[q \|\| p]	Regularization, proximity to the prior	Latent space structure
ELBO overall	Lower bound on log p(x)	Data likelihood
KL[q \|\| p(z \| x)]	Approximation tightness	How close q is to the truth

Historical note

VAE minimizes reconstruction loss the same way as a plain autoencoder.

VAE maximizes ELBO = reconstruction - KL[q || prior]. KL regularization fundamentally changes the objective: not just exact reconstruction but also a structured latent space.

Without KL the VAE degenerates into a deterministic autoencoder. KL ensures z can be sampled from the prior and still give meaningful reconstructions.

The KL term in ELBO, KL[q(z | x) || p(z)], is always >= 0. What happens to training if you drop it (beta = 0 in beta-VAE)?

ELBO as coding

Method	Information principle	Optimization target
VAE	ELBO = MDL two-part code	log p(x) lower bound
beta-VAE	beta * KL + reconstruction	Disentanglement
VQ-VAE	Discrete codebook = Huffman	Discrete representations
InfoVAE	ELBO + mutual information term	z-structure quality

Historical note

Small KL in a VAE is a sign of a good model.

Small KL can mean posterior collapse (q ~ prior), which destroys reconstruction. The optimum is a balance between KL and reconstruction.

Minimizing only KL is trivial: set q = prior. But then z carries no information about x and reconstruction is impossible.

VAE with KL = 50 nats and reconstruction loss = 30 nats versus a model with KL = 10 nats and reconstruction = 100 nats. Which has the better ELBO?

Information Bottleneck

beta	Preferred Z	DL example
beta -> 0	Z = X (no compression)	Autoencoder without regularization
beta ~ 1	Balance of usefulness and compression	VAE, standard training
beta >> 1	Z is a minimal Y-statistic	Strong IB, feature selection
beta -> infinity	Z = const (maximum compression)	Input is ignored

Historical note

Information Bottleneck proves neural networks compress information about X during training.

Tishby's compression-phase hypothesis is debated and depends on activation function and on how I(Z; X) is measured.

Saxe et al. (2018) showed that with ReLU activations no compression is observed. The effect is specific to saturating activations like tanh and sigmoid.

In Information Bottleneck the Markov chain Y - X - Z means:

Mutual information estimation

Method	Type	Use case	Cost
MINE	Lower bound	Network diagnostics	O(N^2)
InfoNCE	Lower bound	SimCLR, CLIP	O(N^2)
CLUB	Upper bound	IB minimization	O(N^2)
KSG	Non-parametric	Low-dim cases	O(N log N)

Historical note

InfoNCE loss minimizes I(X; Z) for better invariance.

InfoNCE minimizes -I(X; Z), equivalently maximizes I(X; Z). The goal is maximal mutual information between input and representation.

Contrastive learning wants representations to carry maximum information about the source data (positive pairs) and minimum about irrelevant data (negatives).

SimCLR learns representations by maximizing I(X; Z). Why use InfoNCE instead of computing I(X; Z) directly?

Takeaways

**ELBO** = E_q[log p(x | z)] - KL[q || p] is a lower bound on log p(x). Maximizing ELBO minimizes a two-part code (MDL).
**Information Bottleneck:** max I(Z; Y) - beta * I(Z; X). Tradeoff: Z must be sufficient for Y and minimal relative to X.
**MINE/InfoNCE:** neural estimators of mutual information. InfoNCE <= I(X; Y). Contrastive learning = maximizing I.
**VAE = IB:** under appropriate assumptions VAE solves the IB problem. ELBO is the learning objective; it is an information principle.

Вопросы для размышления

VAE and contrastive learning are different ways to maximize I(X; Z). What is the principled difference in their strategies?
If beta -> infinity in beta-VAE, what happens to reconstruction? Why does that correspond to maximum 'compression' in Information Bottleneck?
CLIP trains on billions of image-text pairs. From an information-theoretic point of view, what is stored in CLIP weights after training?

Information theory in machine learning

Предварительные знания

Variational inference and ELBO

Historical note

ELBO as coding

Historical note

Information Bottleneck

Historical note

Mutual information estimation

Historical note

Takeaways

Related topics

Вопросы для размышления

Связанные уроки

Information theory in machine learning

Предварительные знания

Variational inference and ELBO

Historical note

ELBO as coding

Historical note

Information Bottleneck

Historical note

Mutual information estimation

Historical note

Takeaways

Related topics

Вопросы для размышления

Связанные уроки