Information Geometry

IG in Generative Models

Stable Diffusion, FLUX.1, VAE in ChatGPT - behind every generative framework stands one idea: how to project one distribution onto another along the shortest path in probability space. Score functions, ELBO, flow matching - these are different formulations of the same information-geometric problem.

DDPM score matching = training tangent vectors to the manifold of noisy data distributions
FLUX.1 (Black Forest Labs) - flow matching speeds up generation 3-5x via OT geodesics
beta-VAE: increasing beta = stronger regularization of the projection onto the prior sphere

ELBO as an Information-Geometric Projection

The **Evidence Lower Bound** is the central object in VAEs. The formula: $\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) \| p(z))$. The first term measures reconstruction quality; the second regularizes the latent space geometry.

**Information-geometric interpretation:** maximizing ELBO is an m-projection of $q_\phi$ onto the manifold of exact posteriors. The KL term is the squared Fisher-Rao distance from $q$ to the prior $p(z)$. The geodesic distance uses the Fisher metric: $d^2(q, p) = D_{KL}(q \| p) + D_{KL}(p \| q)$.

The **amortized inference gap** is the difference between ELBO and log-likelihood. It equals the KL distance from the amortized posterior $q_\phi(z|x)$ to the true posterior $p_\theta(z|x)$: $\log p(x) - \mathcal{L} = D_{KL}(q_\phi(z|x) \| p_\theta(z|x)) \geq 0$. Geometrically: $q$ fails to reach the exact posterior manifold - the encoder lacks sufficient capacity.

Stable Diffusion uses a VAE latent space of size 64x64x4 instead of 512x512x3. The quality of the VAE projection directly caps generation quality - a poor encoder creates an amortized gap that the diffusion model cannot recover from.

Why is ELBO a lower bound on the log-likelihood?

Score Functions as the Geometry of Distributions

The **score function** $s_\theta(x) = \nabla_x \log p_\theta(x)$ is a vector pointing toward increasing density. In information-geometric terms: a tangent vector to the distribution manifold in the Fisher metric. The normalizing constant vanishes under differentiation - no partition function needed.

**Fisher information via score:** $\mathcal{F}(\theta) = \mathbb{E}[\nabla_\theta \log p_\theta \cdot \nabla_\theta \log p_\theta^T]$. Score matching (Hyvärinen, 2005) trains the score function directly by minimizing $\mathbb{E}[\|\nabla_x \log p_\theta - \nabla_x \log p_{data}\|^2]$ - without computing normalizing constants.

**Stein discrepancy** generalizes KL divergence through scores: $S(q, p) = \mathbb{E}_q[\|\nabla \log q - \nabla \log p\|^2_{\mathcal{F}^{-1}}]$. This measures distance between distributions in the Riemannian metric. It equals zero if and only if $q = p$ - a valid divergence.

DDPM trains $\varepsilon_\theta(x_t, t) \approx -\sqrt{1-\bar\alpha_t} \nabla_{x_t} \log p_t(x_t)$. This is exactly the score function of the noised distribution, with inverted sign and rescaling. Stable Diffusion, Imagen, DALL-E 3 all use this parametrization.

What is a score function in information-geometric terms?

Diffusion Models Through the IG Lens

**Forward diffusion process:** $q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar\alpha_t} x_0, (1-\bar\alpha_t)I)$. On the manifold of Gaussian distributions with the Fisher-Rao metric, this is a geodesic from the data distribution to the isotropic Gaussian $\mathcal{N}(0, I)$. The noising schedule is not arbitrary - it traces the shortest path.

**Denoising score matching loss:** $\mathcal{L}_{DSM} = \mathbb{E}_{t, x_0, \varepsilon}[\|\varepsilon_\theta(x_t, t) - \varepsilon\|^2]$ - in effect, the Fisher distance between the true and predicted score. The minimum is achieved when the model knows the exact tangent vector to the geodesic at each point.

**Flow matching** (Lipman et al., 2022) - the continuous version of diffusion. A vector field $v_t(x)$ on the manifold generates the path $x_0 \to x_1$ via an ODE: $dx/dt = v_t(x)$. Optimal transport conditional flow matching (OT-CFM) selects $v_t$ so that paths are Wasserstein geodesics - shortest paths in the space of measures.

**Schrödinger bridge** - a diffusion process with minimal KL divergence from a reference process. Formally: $\min_{P} D_{KL}(P \| W)$ where $W$ is Wiener measure and $P$ is a process with prescribed marginals. This is an m-projection problem on the space of path measures.

FLUX.1 (Black Forest Labs, 2024) uses flow matching on top of latent diffusion with rectified flows. OT-CFM achieves 3-5x faster convergence than DDPM by using straight-line trajectories - fewer steps for the same generation quality.

How does flow matching relate to optimal transport?

Unified View: VAE, GAN, Flow, Diffusion

All four families of generative models are projections onto different manifolds in distribution space. **VAE:** m-projection of the amortized posterior onto the manifold of factorized priors (KL minimization). **GAN:** adversarial approximation of Wasserstein/JS divergence. **Normalizing flows:** exact e-projection through a chain of diffeomorphisms (det-Jacobian = volume). **Diffusion/score:** reverse transport along score-field geodesics.

**Generalized Pythagorean theorem:** in flat dual geometry, $D_{KL}(p \| r) = D_{KL}(p \| q) + D_{KL}(q \| r)$ whenever $q$ is the m-projection of $p$ onto a submanifold containing $r$. This is the information-geometric analog of the Pythagorean theorem along e-/m-geodesics.

**Mode coverage vs mode dropping.** m-projection (VAE) covers all modes - it minimizes $D_{KL}(q \| p)$, which penalizes zero probability where $p > 0$. e-projection (GAN, flow to data) can drop modes - it minimizes $D_{KL}(p \| q)$, which tolerates zero $q$ mass on some modes. Geometrically: two different projections onto the same submanifold.

Failure modes through a geometric lens: VAEs blur images (m-projection averages over all modes), GANs produce artifacts and mode collapse (e-projection concentrates on a subset of modes). The choice of divergence = choice of search geometry = qualitatively different failure patterns.

Why do VAEs produce blurry images compared to GANs?

Итоги

ELBO is an m-projection of the amortized posterior onto the exact posterior manifold; the KL term measures Fisher-Rao distance from $q$ to the prior
Score function $\nabla_x \log p(x)$ is a tangent vector to the distribution manifold in the Fisher metric; DDPM learns it through denoising
Forward diffusion is a geodesic on the Gaussian manifold; OT-CFM finds straight-line trajectories as Wasserstein geodesics
VAE, GAN, Flow, and Diffusion are projections onto different manifolds - the choice of divergence determines mode coverage vs mode dropping

Вопросы для размышления

What happens to VAE behavior when switching the training objective from $D_{KL}(q \| p)$ to $D_{KL}(p \| q)$? Which failure mode disappears, and which appears?
The Schrödinger bridge minimizes KL from a process to Wiener measure. In what sense does it generalize both ELBO and optimal transport?
Why do normalizing flows avoid mode dropping - unlike GANs - if both are forms of e-projection?

Связанные уроки

ig-08-info-projection — ELBO optimization is m-projection onto the posterior manifold
ig-11-wasserstein-vs-fisher — Generative models choose between Fisher and Wasserstein geometry
ig-07-natural-gradient — Score functions are covariant gradients in the Fisher metric
ot-11-flow-matching — Flow matching is geodesic transport on the distribution manifold
ig-04-kl-bregman — KL divergence defines all projections in ELBO and diffusion