Information Geometry

Statistical manifolds: distributions as points of geometry

1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Today it underlies Adam, K-FAC, TRPO, and VAE - in short, practically all of modern deep learning.

Adam - diagonal approximation of empirical Fisher: $g/\sqrt{E[g^2]}$ is diagonal natural gradient
K-FAC and Shampoo - block Fisher approximations for 2-5x faster LLM training
TRPO/PPO - step in Fisher-Rao metric: KL constraint on policy change
VAE ELBO - minimize $\mathrm{KL}(q_\phi \| p)$: m-projection of posterior onto parametric family
Mirror descent on simplex: exponentiated gradient for online learning and bandit algorithms

Предварительные знания

Partial derivatives and gradient of a multivariate function
Expectation: $\mathbb{E}[f(X)] = \int f(x) p(x) dx$
Log-likelihood and MLE: $\hat\theta = \arg\max \sum \log p(x_i; \theta)$

Distributions as points: the statistical manifold

1945: Rao notices that the space of probability distributions is itself a manifold with a natural geometry. The observation waited 40 years for Amari's work. Here is the idea: a parametric family $\mathcal{M} = \{p(x; \theta) : \theta \in \Theta \subset \mathbb{R}^d\}$ is not just a collection of functions - it is a **manifold**. Each $\theta$ is a point. Parameters are coordinates. All tools of differential geometry apply to this object.

**The tangent vector** at point $\theta$ is $\partial_i \log p(x; \theta)$, called the **score function**. It is a random function of $x$ that lives in the tangent space $T_\theta \mathcal{M}$. Intuition: the score is the direction of steepest log-likelihood ascent - exactly what backprop computes during training.

**ML insight:** when a neural network is updated via the gradient over $\theta$, it moves through Euclidean parameter space - as if the manifold were flat. It is not flat. A small step in $\theta$ can yield a large shift in the output distribution (and vice versa). Information geometry provides the right metric - one that reflects the difference between distributions, not between coordinates.

What is a 'point' on the statistical manifold $\mathcal{M} = \{p(x; \theta)\}$?

Fisher metric: the right distance between distributions

If $\mathcal{M}$ is a manifold, it needs a metric. Rao proposed an inner product on the tangent space via the expected product of score functions: $g_{ij}(\theta) = \mathbb{E}_{p(x;\theta)}[\partial_i \log p \cdot \partial_j \log p]$. In matrix form this is the **Fisher information matrix**: $\mathcal{I}(\theta) = \mathbb{E}[\nabla_\theta \log p \cdot (\nabla_\theta \log p)^\top]$.

**Three equivalent formulas** (under regularity): $\mathcal{I}(\theta) = \mathbb{E}[\nabla \log p \cdot (\nabla \log p)^\top] = -\mathbb{E}[\nabla^2 \log p] = \mathrm{Cov}(\nabla \log p)$. **Cramér-Rao bound:** $\mathrm{Var}(\hat\theta) \geq \mathcal{I}^{-1}$ - no unbiased estimator is more accurate than the inverse Fisher. **Natural gradient:** $\theta \leftarrow \theta - \eta \mathcal{I}^{-1} \nabla L$ - a step in the Fisher-Rao metric, invariant under reparameterization.

**The main surprise for an engineer:** the space of Gaussians $\{N(\mu, \sigma)\}$ with the Fisher metric is not Euclidean but **hyperbolic** (the Lobachevsky plane). So: the distance between $N(0,1)$ and $N(0,2)$ is not the same as between $N(0,1)$ and $N(1,1)$, even when they are equally far in coordinates. That is why SGD over $\sigma$ suffers instability near the boundaries.

The Fisher matrix $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ is zero when:

Exponential families: flat geometry and duality

Exponential families $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$ are the canonical object of information geometry. Amari showed: these manifolds carry **two** natural coordinate systems. Natural parameters $\theta$ and mean parameters $\eta = \mathbb{E}[T(X)] = \nabla A(\theta)$. The link is the Legendre transform: $A^*(\eta) = \sup_\theta (\theta^\top \eta - A(\theta))$.

**Dually flat structure (Amari):** each of the two coordinate systems ($\theta$ and $\eta$) induces its own flat affine connection. $\theta$-coordinates are flat in the e-connection (exponential), $\eta$-coordinates in the m-connection (mixture). This is dual flat structure. **KL as Bregman divergence:** $\mathrm{KL}(p_\theta \| p_{\theta'}) = A(\theta') - A(\theta) - \nabla A(\theta)^\top(\theta' - \theta)$ - Bregman divergence from $A$.

**In production ML:** Adam is a diagonal approximation of the empirical Fisher. K-FAC is a block approximation (Kronecker product). TRPO/PPO is a step in the Fisher-Rao metric with a KL constraint. VAE ELBO minimizes $\mathrm{KL}(q_\phi \| p)$: m-projection of the posterior onto the parametric family. Mirror descent on the simplex: exponentiated gradient for online learning and bandit algorithms. Not theory - tooling without which modern ML does not run.

In the exponential family $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$, mean parameters $\eta$ relate to natural parameters $\theta$ as:

Takeaways

$\mathcal{M} = \{p(x; \theta)\}$ is a manifold: points are distributions, coordinates are parameters, tangent space consists of score functions $\partial_i \log p$
$\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ - the unique invariant metric on a statistical manifold
The space of Gaussians $N(\mu, \sigma)$ is hyperbolic; the simplex is spherical. Euclidean SGD does not see this geometry
Natural gradient $\mathcal{I}^{-1} g$: invariant under reparameterization, solves $\min_\Delta L$ subject to KL constraint. Adam, K-FAC, Shampoo are approximations
Exp-families: dual flat structure ($\theta$ and $\eta = \nabla A(\theta)$); KL as Bregman divergence; MLE = m-projection
TRPO/PPO, K-FAC, VAE/ELBO, mirror descent - IG in production, not just theory

Where to next

The manifold is set up. Next comes its geometry and applications.

Fisher metric — Precise definition as a Riemannian metric. Properties, relation to Cramér-Rao
Exponential families — Natural and mean parameters, log-partition function, link to MLE and sufficient statistics
KL and Bregman divergences — KL as Bregman divergence from log-partition. Asymmetry, Pythagorean property
Natural gradient — The main practical application of IG. Why faster than SGD, which approximations work in production

Вопросы для размышления

In which current ML tasks does the team use KL divergence (VAE, RLHF, distillation)? Do engineers realize they are working with information geometry?
If the optimizer were chosen not as 'Adam by default' but from understanding the Fisher metric of the loss surface - what would change in training large models?
Which problems in architectures (vanishing gradients, instability near softmax boundaries, mode collapse in GANs) can be reformulated as 'wrong metric on the manifold'?

Связанные уроки

stat-27-graphical-models

Distributions as points: the statistical manifold

What is a 'point' on the statistical manifold $\mathcal{M} = \{p(x; \theta)\}$?

Fisher metric: the right distance between distributions

The Fisher matrix $\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ is zero when:

Exponential families: flat geometry and duality

In the exponential family $p(x; \theta) = h(x) \exp(\theta^\top T(x) - A(\theta))$, mean parameters $\eta$ relate to natural parameters $\theta$ as:

Takeaways

$\mathcal{M} = \{p(x; \theta)\}$ is a manifold: points are distributions, coordinates are parameters, tangent space consists of score functions $\partial_i \log p$

$\mathcal{I}(\theta) = \mathbb{E}[\nabla\log p \cdot (\nabla\log p)^\top]$ - the unique invariant metric on a statistical manifold

The space of Gaussians $N(\mu, \sigma)$ is hyperbolic; the simplex is spherical. Euclidean SGD does not see this geometry

Natural gradient $\mathcal{I}^{-1} g$: invariant under reparameterization, solves $\min_\Delta L$ subject to KL constraint. Adam, K-FAC, Shampoo are approximations

Exp-families: dual flat structure ($\theta$ and $\eta = \nabla A(\theta)$); KL as Bregman divergence; MLE = m-projection

TRPO/PPO, K-FAC, VAE/ELBO, mirror descent - IG in production, not just theory