Information Geometry

Statistical manifolds: distributions as points of geometry

1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Today it underlies Adam, K-FAC, TRPO, and VAE - in short, practically all of modern deep learning.

  • Adam - diagonal approximation of empirical Fisher: $g/\sqrt{E[g^2]}$ is diagonal natural gradient
  • K-FAC and Shampoo - block Fisher approximations for 2-5x faster LLM training
  • TRPO/PPO - step in Fisher-Rao metric: KL constraint on policy change
  • VAE ELBO - minimize $\mathrm{KL}(q_\phi \| p)$: m-projection of posterior onto parametric family
  • Mirror descent on simplex: exponentiated gradient for online learning and bandit algorithms

Предварительные знания

  • Partial derivatives and gradient of a multivariate function
  • Expectation: $\mathbb{E}[f(X)] = \int f(x) p(x) dx$
  • Log-likelihood and MLE: $\hat\theta = \arg\max \sum \log p(x_i; \theta)$
  • Functions of Several Variables
  • MLE: Why PyTorch's Cross-Entropy Loss is Fisher's 1922 Formula

Distributions as points: the statistical manifold

1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Here is the idea: a parametric family $\mathcal{M} = \{p(x; \theta) : \theta \in \Theta \subset \mathbb{R}^d\}$ is not just a collection of functions - it is a **manifold**. Each $\theta$ is a point. Parameters are coordinates. All tools of differential geometry apply to this object.

**The tangent vector** at point $\theta$ is $\partial_i \log p(x; \theta)$, called the **score function**. It is a random function of $x$ that lives in the tangent space $T_\theta \mathcal{M}$. Intuition: the score is the direction of steepest log-likelihood ascent - exactly what backprop computes during training.

**ML insight:** when a neural network is updated via the gradient over $\theta$, it moves through Euclidean parameter space - as if the manifold were flat. It is not flat. A small step in $\theta$ can yield a large shift in the output distribution (and vice versa). Information geometry provides the right metric - one that reflects the difference between distributions, not between coordinates.

What is a 'point' on the statistical manifold $\mathcal{M} = \{p(x; \theta)\}$?

Fisher metric: the right distance between distributions

If $\mathcal{M}$ is a manifold, it needs a metric. Rao proposed an inner product on the tangent space via the expected product of score functions: $g_{ij}(\theta) = \mathbb{E}_{p(x;\theta)}[\partial_i \log p \cdot \partial_j \log p]$. In matrix form this is the **Fisher information matrix**: $\mathcal{I}(\theta) = \mathbb{E}[\nabla_\theta \log p \cdot (\nabla_\theta \log p)^\top]$.

**Three equivalent formulas** (under regularity): $\mathcal{I}(\theta) = \mathbb{E}[\nabla \log p \cdot (\nabla \log p)^\top] = -\mathbb{E}[\nabla^2 \log p] = \mathrm{Cov}(\nabla \log p)$. **Cramér-Rao bound:** $\mathrm{Var}(\hat\theta) \geq \mathcal{I}^{-1}$ - no unbiased estimator is more accurate than the inverse Fisher. **Natural gradient:** $\theta \leftarrow \theta - \eta \mathcal{I}^{-1} \nabla L$ - a step in the Fisher-Rao metric, invariant under reparameterization.

**The main surprise for an engineer:** the space of Gaussians $\{N(\mu, \sigma)\}$ with the Fisher metric is not Euclidean but **hyperbolic** (the Lobachevsky plane). So: the distance between $N(0,1)$ and $N(0,2)$ is not the same as between $N(0,1)$ and $N(1,1)$, even when they are equally far in coordinates. That is why SGD over $\sigma$ suffers instability near the boundaries.

The Fisher matrix $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ is zero when:

Exponential families: flat geometry and duality

Exponential families $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$ are the canonical object of information geometry. Amari showed: these manifolds carry **two** natural coordinate systems. Natural parameters $\theta$ and mean parameters $\eta = \mathbb{E}[T(X)] = \nabla A(\theta)$. The link is the Legendre transform: $A^*(\eta) = \sup_\theta (\theta^\top \eta - A(\theta))$.

**Dually flat structure (Amari):** each of the two coordinate systems ($\theta$ and $\eta$) induces its own flat affine connection. $\theta$-coordinates are flat in the e-connection (exponential), $\eta$-coordinates in the m-connection (mixture). This is dual flat structure. **KL as Bregman divergence:** $\mathrm{KL}(p_\theta \| p_{\theta'}) = A(\theta') - A(\theta) - \nabla A(\theta)^\top(\theta' - \theta)$ - Bregman divergence from $A$.

**In production ML:** Adam is a diagonal approximation of the empirical Fisher. K-FAC is a block approximation (Kronecker product). TRPO/PPO is a step in the Fisher-Rao metric with a KL constraint. VAE ELBO minimizes $\mathrm{KL}(q_\phi \| p)$: m-projection of the posterior onto the parametric family. Mirror descent on the simplex: exponentiated gradient for online learning and bandit algorithms. Not theory - tooling without which modern ML does not run.

In the exponential family $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$, mean parameters $\eta$ relate to natural parameters $\theta$ as:

Takeaways

  • $\mathcal{M} = \{p(x; \theta)\}$ is a manifold: points are distributions, coordinates are parameters, tangent space consists of score functions $\partial_i \log p$
  • $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ - the unique invariant metric on a statistical manifold
  • The space of Gaussians $N(\mu, \sigma)$ is hyperbolic; the simplex is spherical. Euclidean SGD does not see this geometry
  • Natural gradient $\mathcal{I}^{-1} g$: invariant under reparameterization, solves $\min_\Delta L$ subject to KL constraint. Adam, K-FAC, Shampoo are approximations
  • Exp-families: dual flat structure ($\theta$ and $\eta = \nabla A(\theta)$); KL as Bregman divergence; MLE = m-projection
  • TRPO/PPO, K-FAC, VAE/ELBO, mirror descent - IG in production, not just theory

Where to next

The manifold is set up. Next comes its geometry and applications.

  • Fisher metric — Precise definition as a Riemannian metric. Properties, relation to Cramér-Rao
  • Exponential families — Natural and mean parameters, log-partition function, link to MLE and sufficient statistics
  • KL and Bregman divergences — KL as Bregman divergence from log-partition. Asymmetry, Pythagorean property
  • Natural gradient — The main practical application of IG. Why faster than SGD, which approximations work in production

Вопросы для размышления

  • In which current ML tasks does the team use KL divergence (VAE, RLHF, distillation)? Do engineers realize they are working with information geometry?
  • If the optimizer were chosen not as 'Adam by default' but from understanding the Fisher metric of the loss surface - what would change in training large models?
  • Which problems in architectures (vanishing gradients, instability near softmax boundaries, mode collapse in GANs) can be reformulated as 'wrong metric on the manifold'?

Связанные уроки

  • stat-27-graphical-models
Statistical manifolds: distributions as points of geometry

0

1

Sign In