Information Geometry

Natural Gradient in Neural Networks

SGD moves through the Euclidean parameter space. Parameters define probability distributions - a curved space. Amari natural gradient moves along that curvature and converges fundamentally faster.

K-FAC (DeepMind): natural gradient approximation for AlphaGo Zero - convergence 10x faster than SGD; also used in AlphaFold 2 for protein structure prediction
TRPO/PPO in RL: KL-divergence constraint between policies = bounded step in Fisher metric; guarantees monotone improvement without regression
Adam optimizer: division by sqrt(v_t) approximates the diagonal of the Fisher matrix - a diagonal natural gradient approximation
Bayesian optimization: geometry of distribution space for efficient next-point selection in AutoML pipelines

Предварительные знания

Fisher information matrix
Riemannian geometry
SGD and Adam

Amari Natural Gradient

The Fisher information matrix F(theta) defines a Riemannian metric on the manifold of probability distributions. For the family p(x|theta) the entry F_{ij}(theta) = E[d_i log p * d_j log p]. Geometrically: F(theta) approximates the KL-divergence in a small neighborhood of theta as a quadratic form. The natural gradient is the gradient in this Riemannian metric rather than in the Euclidean one, giving the direction of steepest descent in the space of distributions.

Reparameterization invariance is the key property of the natural gradient: F(theta)^{-1} nabla L does not depend on the particular coordinate system used for the parameters, only on the underlying distribution manifold. Rewrite the network weights in any invertible way - the effective step in distribution space remains the same.

DeepMind applies K-FAC for training AlphaGo Zero: convergence 10x faster than SGD. Google uses K-FAC in AlphaFold 2. TRPO in reinforcement learning uses the natural gradient with a KL constraint, guaranteeing monotone policy improvement without regression.

What does the Amari natural gradient do to the ordinary loss gradient?

Natural gradient: nabla-tilde L = F(theta)^{-1} nabla L. The Fisher matrix F defines a Riemannian metric capturing the curvature of the space of probability distributions. Newton's method uses the loss Hessian; the natural gradient uses the geometrically motivated Fisher matrix, which is always positive semidefinite.

K-FAC and TRPO: Natural Gradient in Practice

K-FAC (Kronecker-Factored Approximate Curvature) approximates the block-diagonal Fisher matrix layer by layer. For layer l the Fisher block F_l is approximated as a Kronecker product A_l (x) G_l, where A_l = E[a_{l-1} a_{l-1}^T] is the covariance of input activations and G_l = E[g_l g_l^T] is the covariance of pre-activation gradients. This allows efficient inversion without explicitly forming the d x d matrix.

PPO approximates TRPO by clipping the probability ratio r_t(theta) = pi_theta(a|s)/pi_theta_old(a|s) to [1-eps, 1+eps] instead of enforcing the KL constraint explicitly. This loses the exact Fisher geometry interpretation but is much cheaper and empirically comparable in performance.

Which Kronecker product property makes K-FAC computationally feasible?

The key K-FAC identity: (A_l x G_l)^{-1} = A_l^{-1} x G_l^{-1}. This reduces inversion of a (d_in * d_out)^2 matrix to inverting two matrices of size d_in^2 and d_out^2, dropping cost from O((d_in d_out)^3) to O(d_in^3 + d_out^3). AlphaFold 2 and AlphaGo Zero rely on this identity.

Fisher-Rao Metric and Reparameterization Invariance

The Fisher-Rao metric is the unique (up to scaling) Riemannian metric on the space of probability distributions that is invariant under sufficient statistics. This uniqueness theorem by Cencov (1975) gives the Fisher metric a canonical status: it is the only metric that does not change under application of a sufficient statistic to the data, i.e., when no information is discarded.

Jeffreys prior p(theta) proportional to sqrt(det F(theta)) is the canonical non-informative prior in Bayesian statistics. It is the unique prior that is invariant under reparameterization of theta - a direct consequence of the Fisher matrix being the natural volume form on the statistical manifold. For a Bernoulli model it gives the Beta(1/2, 1/2) prior (arcsine distribution).

What is special about the Fisher-Rao metric compared to other Riemannian metrics on the space of distributions?

Cencov's theorem (1975): the Fisher information matrix is the unique Riemannian metric on statistical models invariant under Markov morphisms. This uniqueness gives the Fisher metric its canonical status - it is the natural geometry for probability distributions, not an arbitrary choice.

Connections to other topics

The natural gradient unifies neural-network optimization, differential geometry, and information theory.

Newton's method — Related topic
TRPO and PPO — Related topic
EM algorithm — Related topic

Итоги

F(theta) = E[nabla log p * nabla log p^T]: Riemannian metric approximating 2 * KL-divergence near theta
Natural gradient: theta_{t+1} = theta_t - eta * F^{-1} nabla L - steepest descent in KL geometry
Reparameterization invariance: the natural gradient depends only on the distribution manifold, not on the parameter encoding
K-FAC: F_l approx A_l (kron) G_l; inversion cost O(d_in^3 + d_out^3) instead of O((d_in d_out)^3)
TRPO uses a KL constraint equivalent to a bounded Fisher metric step, guaranteeing monotone policy improvement