Information Geometry
Statistical manifolds: distributions as points of geometry
1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Today it underlies Adam, K-FAC, TRPO, and VAE - in short, practically all of modern deep learning.
- Adam - diagonal approximation of empirical Fisher: $g/\sqrt{E[g^2]}$ is diagonal natural gradient
- K-FAC and Shampoo - block Fisher approximations for 2-5x faster LLM training
- TRPO/PPO - step in Fisher-Rao metric: KL constraint on policy change
- VAE ELBO - minimize $\mathrm{KL}(q_\phi \| p)$: m-projection of posterior onto parametric family
- Mirror descent on simplex: exponentiated gradient for online learning and bandit algorithms
Предварительные знания
- Partial derivatives and gradient of a multivariate function
- Expectation: $\mathbb{E}[f(X)] = \int f(x) p(x) dx$
- Log-likelihood and MLE: $\hat\theta = \arg\max \sum \log p(x_i; \theta)$
Distributions as points: the statistical manifold
1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Here is the idea: a parametric family $\mathcal{M} = \{p(x; \theta) : \theta \in \Theta \subset \mathbb{R}^d\}$ is not just a collection of functions - it is a **manifold**. Each $\theta$ is a point. Parameters are coordinates. All tools of differential geometry apply to this object.
**The tangent vector** at point $\theta$ is $\partial_i \log p(x; \theta)$, called the **score function**. It is a random function of $x$ that lives in the tangent space $T_\theta \mathcal{M}$. Intuition: the score is the direction of steepest log-likelihood ascent - exactly what backprop computes during training.
**ML insight:** when a neural network is updated via the gradient over $\theta$, it moves through Euclidean parameter space - as if the manifold were flat. It is not flat. A small step in $\theta$ can yield a large shift in the output distribution (and vice versa). Information geometry provides the right metric - one that reflects the difference between distributions, not between coordinates.
What is a 'point' on the statistical manifold $\mathcal{M} = \{p(x; \theta)\}$?
Fisher metric: the right distance between distributions
If $\mathcal{M}$ is a manifold, it needs a metric. Rao proposed an inner product on the tangent space via the expected product of score functions: $g_{ij}(\theta) = \mathbb{E}_{p(x;\theta)}[\partial_i \log p \cdot \partial_j \log p]$. In matrix form this is the **Fisher information matrix**: $\mathcal{I}(\theta) = \mathbb{E}[\nabla_\theta \log p \cdot (\nabla_\theta \log p)^\top]$.
**Three equivalent formulas** (under regularity): $\mathcal{I}(\theta) = \mathbb{E}[\nabla \log p \cdot (\nabla \log p)^\top] = -\mathbb{E}[\nabla^2 \log p] = \mathrm{Cov}(\nabla \log p)$. **Cramér-Rao bound:** $\mathrm{Var}(\hat\theta) \geq \mathcal{I}^{-1}$ - no unbiased estimator is more accurate than the inverse Fisher. **Natural gradient:** $\theta \leftarrow \theta - \eta \mathcal{I}^{-1} \nabla L$ - a step in the Fisher-Rao metric, invariant under reparameterization.
**The main surprise for an engineer:** the space of Gaussians $\{N(\mu, \sigma)\}$ with the Fisher metric is not Euclidean but **hyperbolic** (the Lobachevsky plane). So: the distance between $N(0,1)$ and $N(0,2)$ is not the same as between $N(0,1)$ and $N(1,1)$, even when they are equally far in coordinates. That is why SGD over $\sigma$ suffers instability near the boundaries.
The Fisher matrix $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ is zero when:
Exponential families: flat geometry and duality
Exponential families $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$ are the canonical object of information geometry. Amari showed: these manifolds carry **two** natural coordinate systems. Natural parameters $\theta$ and mean parameters $\eta = \mathbb{E}[T(X)] = \nabla A(\theta)$. The link is the Legendre transform: $A^*(\eta) = \sup_\theta (\theta^\top \eta - A(\theta))$.
**Dually flat structure (Amari):** each of the two coordinate systems ($\theta$ and $\eta$) induces its own flat affine connection. $\theta$-coordinates are flat in the e-connection (exponential), $\eta$-coordinates in the m-connection (mixture). This is dual flat structure. **KL as Bregman divergence:** $\mathrm{KL}(p_\theta \| p_{\theta'}) = A(\theta') - A(\theta) - \nabla A(\theta)^\top(\theta' - \theta)$ - Bregman divergence from $A$.
**In production ML:** Adam is a diagonal approximation of the empirical Fisher. K-FAC is a block approximation (Kronecker product). TRPO/PPO is a step in the Fisher-Rao metric with a KL constraint. VAE ELBO minimizes $\mathrm{KL}(q_\phi \| p)$: m-projection of the posterior onto the parametric family. Mirror descent on the simplex: exponentiated gradient for online learning and bandit algorithms. Not theory - tooling without which modern ML does not run.
In the exponential family $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$, mean parameters $\eta$ relate to natural parameters $\theta$ as:
Takeaways
- $\mathcal{M} = \{p(x; \theta)\}$ is a manifold: points are distributions, coordinates are parameters, tangent space consists of score functions $\partial_i \log p$
- $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ - the unique invariant metric on a statistical manifold
- The space of Gaussians $N(\mu, \sigma)$ is hyperbolic; the simplex is spherical. Euclidean SGD does not see this geometry
- Natural gradient $\mathcal{I}^{-1} g$: invariant under reparameterization, solves $\min_\Delta L$ subject to KL constraint. Adam, K-FAC, Shampoo are approximations
- Exp-families: dual flat structure ($\theta$ and $\eta = \nabla A(\theta)$); KL as Bregman divergence; MLE = m-projection
- TRPO/PPO, K-FAC, VAE/ELBO, mirror descent - IG in production, not just theory
Where to next
The manifold is set up. Next comes its geometry and applications.
- Fisher metric — Precise definition as a Riemannian metric. Properties, relation to Cramér-Rao
- Exponential families — Natural and mean parameters, log-partition function, link to MLE and sufficient statistics
- KL and Bregman divergences — KL as Bregman divergence from log-partition. Asymmetry, Pythagorean property
- Natural gradient — The main practical application of IG. Why faster than SGD, which approximations work in production
Вопросы для размышления
- In which current ML tasks does the team use KL divergence (VAE, RLHF, distillation)? Do engineers realize they are working with information geometry?
- If the optimizer were chosen not as 'Adam by default' but from understanding the Fisher metric of the loss surface - what would change in training large models?
- Which problems in architectures (vanishing gradients, instability near softmax boundaries, mode collapse in GANs) can be reformulated as 'wrong metric on the manifold'?