Information Geometry

Statistical Manifolds and Fisher Information

Google Brain uses information geometry for natural gradient (K-FAC): 4x faster ResNet-50 training compared to SGD. The key insight is treating the space of distributions as a Riemannian manifold where Fisher information defines the metric.

K-FAC (Martens & Grosse, 2015): 4x ResNet-50 speedup via natural gradient
Amari (1998): natural gradient converges in O(1) steps near saddle points vs O(kappa) for SGD
Geodesics on the statistical manifold are optimal parameter update paths

Statistical Manifold

Google Brain sped up ResNet-50 by 4x without a new architecture, just better geometry. The space of all normal distributions N(mu, sigma^2) is not a flat plane but a 2D Riemannian manifold. The shortest path from N(0,1) to N(5,4) is not a straight line in (mu,sigma) coordinates but a geodesic on this curved space.

The Fisher matrix is always positive semi-definite, guaranteeing a valid Riemannian metric. Strict positive definiteness (a regular manifold) requires model identifiability.

What does the Fisher information matrix I(theta) measure on a statistical manifold?

Natural Gradient: Steepest Descent in the Fisher Metric

The ordinary gradient nabla L depends on parametrization: switching from theta to phi=2*theta doubles the gradient even though the geometry is unchanged. The natural gradient I(theta)^{-1} nabla L is invariant to reparametrization because it lives in the correct geometry of the space of distributions.

TRPO and PPO in reinforcement learning are direct applications of the natural gradient: the trust region constrains the KL step, which is equivalent to natural gradient descent with an epsilon bound.

Why is the natural gradient invariant to reparametrization while the ordinary gradient is not?

Exponential Families as Statistical Manifolds

Exponential families , Gaussian, Bernoulli, Poisson, Dirichlet , are the richest class of statistical manifolds. Their geometry is special: they are simultaneously e-flat (in natural parameters theta) and m-flat (in moment parameters eta = E[T(x)]), creating Amari's dual structure.

The Legendre transform between theta and eta is Amari's e/m duality. It underlies the EM algorithm: the E-step is an m-projection, the M-step is an e-projection.

In an exponential family p(x;theta)=h(x)exp(theta*T(x)-A(theta)), what is the Fisher matrix I(theta)?

Итоги

Statistical manifold: M = {p(x;theta) | theta in Theta} with metric g_{ij}(theta) = I_{ij}(theta)
Score function: s(theta) = d log p/d theta; E[s]=0; I(theta)=E[ss^T]=-E[d^2 log p/d theta^2]
Natural gradient: theta <- theta - alpha * I(theta)^{-1} grad L , invariant to reparametrization
Exponential family: p(x;theta)=h(x)exp(theta*T(x)-A(theta)); I(theta)=nabla^2 A(theta)

Вопросы для размышления

Why does the Fisher matrix provide a valid Riemannian metric on the space of distributions?
In what sense is the natural gradient invariant to reparametrization while the ordinary gradient is not?
How do geodesics on the statistical manifold relate to optimal parameter updates?

Связанные уроки

ig-07-natural-gradient — ig-18 provides the geometric foundation for natural gradient
ig-03-exp-family — Exponential families are the canonical example of statistical manifolds