Information Theory

Information theory in machine learning

OpenAI's CLIP aligns images and text via InfoNCE, a direct application of the mutual information theorem. GPT-4, VAE, diffusion models. They all implement information-theoretic principles.

  • **VAE and beta-VAE** (DeepMind) used the IB principle for disentangled representations, separating 'shape' from 'color' in latent space.
  • **CLIP** (OpenAI) is trained to maximize I(image; text) through InfoNCE. That is why CLIP understands arbitrary image captions.
  • **Neural compression** (Google Balle, 2020) explicitly optimizes the R-D tradeoff via VAE plus entropy coding. It beats JPEG at low bitrates.

Предварительные знания

  • KL-Divergence and Cross-Entropy

Variational inference and ELBO

Variational inference solves a hard problem: how to approximate the intractable posterior p(z | x)? The idea is to pick a 'simple' family of distributions q(z | x) and find the member closest to p(z | x) in KL divergence. Minimizing KL[q || p] is equivalent to maximizing the Evidence Lower Bound (ELBO), and that is where information theory enters: ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)].

**ELBO:** log p(x) = ELBO + KL[q(z | x) || p(z | x)] >= ELBO. ELBO = E_q[log p(x | z)] - KL[q(z | x) || p(z)]. The first term is reconstruction quality (data likelihood). The second is regularization (proximity to the prior). Equality holds when q = p(z | x): ELBO = log p(x).

ELBO componentInformation-theoretic meaningWhat it optimizes
E_q[log p(x | z)]-H(X | Z), part of mutual informationReconstruction quality
-KL[q || p]Regularization, proximity to the priorLatent space structure
ELBO overallLower bound on log p(x)Data likelihood
KL[q || p(z | x)]Approximation tightnessHow close q is to the truth

Historical note

Variational inference was a statistical tool long before deep learning. It became mainstream in DL with the VAE (Kingma and Welling, 2013), the first scalable generative method trained explicitly on ELBO. Since then ELBO has been the standard tool for generative models and Bayesian deep learning.

VAE minimizes reconstruction loss the same way as a plain autoencoder.

VAE maximizes ELBO = reconstruction - KL[q || prior]. KL regularization fundamentally changes the objective: not just exact reconstruction but also a structured latent space.

Without KL the VAE degenerates into a deterministic autoencoder. KL ensures z can be sampled from the prior and still give meaningful reconstructions.

The KL term in ELBO, KL[q(z | x) || p(z)], is always >= 0. What happens to training if you drop it (beta = 0 in beta-VAE)?

ELBO as coding

ELBO has a beautiful Minimum Description Length (MDL) interpretation: the lower bound on log p(x) is the minimum description length of x. The first ELBO term is the code length of x given z (reconstruction). The second is the code length of z (KL from the prior). Together: ELBO = -(description of z) - (description of x given z). Maximizing ELBO is minimizing the total code length.

**Coding interpretation of ELBO:** -log p(x) >= KL[q(z | x) || p(z)] + E_q[-log p(x | z)]. The right-hand side is the code length of z under the prior plus the code length of x given z. A two-part code: first z, then x. MDL principle: a good model is a good compressor.

MethodInformation principleOptimization target
VAEELBO = MDL two-part codelog p(x) lower bound
beta-VAEbeta * KL + reconstructionDisentanglement
VQ-VAEDiscrete codebook = HuffmanDiscrete representations
InfoVAEELBO + mutual information termz-structure quality

Historical note

The link between ELBO and MDL goes back to Rissanen (1978) and Wallace (1968). Its neural-network application was crystallized by Hinton in the 'Wake-Sleep' algorithm (1995) and later in the VAE (2013). Understanding ELBO as compression opens the door to neural compression.

Small KL in a VAE is a sign of a good model.

Small KL can mean posterior collapse (q ~ prior), which destroys reconstruction. The optimum is a balance between KL and reconstruction.

Minimizing only KL is trivial: set q = prior. But then z carries no information about x and reconstruction is impossible.

VAE with KL = 50 nats and reconstruction loss = 30 nats versus a model with KL = 10 nats and reconstruction = 100 nats. Which has the better ELBO?

Information Bottleneck

Information Bottleneck (Tishby, 2000) is a learning principle: find a representation Z of the input X that is maximally informative about the target Y at minimum complexity (minimum mutual information with X). It is a tradeoff: I(Z; Y) maximal, I(Z; X) minimal. The Lagrangian is max I(Z; Y) - beta * I(Z; X). Connection to VAE: under appropriate assumptions VAE realizes IB.

**Information Bottleneck:** L_IB = I(Z; Y) - beta * I(Z; X) = max. At beta -> 0: Z keeps all information about X. At beta -> infinity: Z is the minimal sufficient statistic for Y. Markov chain: Y - X - Z (Z is defined only through X). Markov bound: I(Z; Y) <= I(X; Y). Z cannot know Y better than X does.

betaPreferred ZDL example
beta -> 0Z = X (no compression)Autoencoder without regularization
beta ~ 1Balance of usefulness and compressionVAE, standard training
beta >> 1Z is a minimal Y-statisticStrong IB, feature selection
beta -> infinityZ = const (maximum compression)Input is ignored

Historical note

Naftali Tishby proposed IB in 2000. In 2017 he and Schweighofer put forward the controversial 'information plane' hypothesis, that neural networks pass through compression and fitting phases. The hypothesis sparked debate: Saxe et al. showed that the effect is an artifact of the activation function.

Information Bottleneck proves neural networks compress information about X during training.

Tishby's compression-phase hypothesis is debated and depends on activation function and on how I(Z; X) is measured.

Saxe et al. (2018) showed that with ReLU activations no compression is observed. The effect is specific to saturating activations like tanh and sigmoid.

In Information Bottleneck the Markov chain Y - X - Z means:

Mutual information estimation

Mutual information I(X; Y) is hard to compute in continuous high-dimensional spaces. Neural estimators address this: MINE (Mutual Information Neural Estimation), CLUB (Contrastive Log-ratio Upper Bound), InfoNCE, all are lower or upper bounds on I(X; Y) implemented as neural networks. The key application is contrastive learning (SimCLR, CLIP), which is maximization of I(X; Z).

**MINE:** I(X; Y) >= E_{p(x, y)}[T(x, y)] - log E_{p(x) p(y)}[e^{T(x, y)}], where T is the neural estimator. Maximizing over T approaches I(X; Y). **InfoNCE:** I(X; Y) >= log(N) - L_NCE, where L_NCE is the contrastive loss. Used in SimCLR and CLIP for representation learning.

MethodTypeUse caseCost
MINELower boundNetwork diagnosticsO(N^2)
InfoNCELower boundSimCLR, CLIPO(N^2)
CLUBUpper boundIB minimizationO(N^2)
KSGNon-parametricLow-dim casesO(N log N)

Historical note

MINE (Belghazi et al., 2018) showed that mutual information can be estimated with a neural network. InfoNCE (van den Oord, 2018) became the backbone of contrastive learning. CLIP (OpenAI, 2021) uses InfoNCE to align images and text, a clean product-grade application of information-theoretic ideas in ML.

InfoNCE loss minimizes I(X; Z) for better invariance.

InfoNCE minimizes -I(X; Z), equivalently maximizes I(X; Z). The goal is maximal mutual information between input and representation.

Contrastive learning wants representations to carry maximum information about the source data (positive pairs) and minimum about irrelevant data (negatives).

SimCLR learns representations by maximizing I(X; Z). Why use InfoNCE instead of computing I(X; Z) directly?

Takeaways

  • **ELBO** = E_q[log p(x | z)] - KL[q || p] is a lower bound on log p(x). Maximizing ELBO minimizes a two-part code (MDL).
  • **Information Bottleneck:** max I(Z; Y) - beta * I(Z; X). Tradeoff: Z must be sufficient for Y and minimal relative to X.
  • **MINE/InfoNCE:** neural estimators of mutual information. InfoNCE <= I(X; Y). Contrastive learning = maximizing I.
  • **VAE = IB:** under appropriate assumptions VAE solves the IB problem. ELBO is the learning objective; it is an information principle.

Related topics

IT in ML links theoretical concepts to practical training methods.

  • Rate-Distortion Theory — ELBO is the R-D tradeoff for coding data. Neural codecs are explicit R-D optimization.
  • Information theory in deep learning — PAC-Bayes, generalization, information plane: the next layer up.
  • Data compression: JPEG, H.265, LLM — Neural compression is VAE plus entropy coding for media.

Вопросы для размышления

  • VAE and contrastive learning are different ways to maximize I(X; Z). What is the principled difference in their strategies?
  • If beta -> infinity in beta-VAE, what happens to reconstruction? Why does that correspond to maximum 'compression' in Information Bottleneck?
  • CLIP trains on billions of image-text pairs. From an information-theoretic point of view, what is stored in CLIP weights after training?

Связанные уроки

  • ml-09-gradient-descent
  • stat-27-graphical-models
Information theory in machine learning

0

1

Sign In