Information Geometry
Statistical Manifolds and Fisher Information
Google Brain uses information geometry for natural gradient (K-FAC): 4x faster ResNet-50 training compared to SGD. The key insight is treating the space of distributions as a Riemannian manifold where Fisher information defines the metric.
- K-FAC (Martens & Grosse, 2015): 4x ResNet-50 speedup via natural gradient
- Amari (1998): natural gradient converges in O(1) steps near saddle points vs O(kappa) for SGD
- Geodesics on the statistical manifold are optimal parameter update paths
Statistical Manifold
Google Brain sped up ResNet-50 by 4x without a new architecture, just better geometry. The space of all normal distributions N(mu, sigma^2) is not a flat plane but a 2D Riemannian manifold. The shortest path from N(0,1) to N(5,4) is not a straight line in (mu,sigma) coordinates but a geodesic on this curved space.
The Fisher matrix is always positive semi-definite, guaranteeing a valid Riemannian metric. Strict positive definiteness (a regular manifold) requires model identifiability.
What does the Fisher information matrix I(theta) measure on a statistical manifold?
Natural Gradient: Steepest Descent in the Fisher Metric
The ordinary gradient nabla L depends on parametrization: switching from theta to phi=2*theta doubles the gradient even though the geometry is unchanged. The natural gradient I(theta)^{-1} nabla L is invariant to reparametrization because it lives in the correct geometry of the space of distributions.
TRPO and PPO in reinforcement learning are direct applications of the natural gradient: the trust region constrains the KL step, which is equivalent to natural gradient descent with an epsilon bound.
Why is the natural gradient invariant to reparametrization while the ordinary gradient is not?
Exponential Families as Statistical Manifolds
Exponential families , Gaussian, Bernoulli, Poisson, Dirichlet , are the richest class of statistical manifolds. Their geometry is special: they are simultaneously e-flat (in natural parameters theta) and m-flat (in moment parameters eta = E[T(x)]), creating Amari's dual structure.
The Legendre transform between theta and eta is Amari's e/m duality. It underlies the EM algorithm: the E-step is an m-projection, the M-step is an e-projection.
In an exponential family p(x;theta)=h(x)exp(theta*T(x)-A(theta)), what is the Fisher matrix I(theta)?
Итоги
- Statistical manifold: M = {p(x;theta) | theta in Theta} with metric g_{ij}(theta) = I_{ij}(theta)
- Score function: s(theta) = d log p/d theta; E[s]=0; I(theta)=E[ss^T]=-E[d^2 log p/d theta^2]
- Natural gradient: theta <- theta - alpha * I(theta)^{-1} grad L , invariant to reparametrization
- Exponential family: p(x;theta)=h(x)exp(theta*T(x)-A(theta)); I(theta)=nabla^2 A(theta)
Вопросы для размышления
- Why does the Fisher matrix provide a valid Riemannian metric on the space of distributions?
- In what sense is the natural gradient invariant to reparametrization while the ordinary gradient is not?
- How do geodesics on the statistical manifold relate to optimal parameter updates?
Связанные уроки
- ig-07-natural-gradient — ig-18 provides the geometric foundation for natural gradient
- ig-03-exp-family — Exponential families are the canonical example of statistical manifolds