Information Geometry

Natural gradient: a covariant derivative for distributions

In 1998 Amari posed a strange question: what does it mean to step by $\epsilon$ in distribution space? If parameterizations $\sigma$, $\log\sigma$, and $\sigma^2$ are all valid, three different SGDs trace three different trajectories. Natural gradient kills that arbitrariness: one geometry, one direction, invariant under reparameterization. The same gap separates Euclidean and Riemannian geometry - and for neural networks it is real, not academic.

**TRPO and PPO in RL.** OpenAI Five (Dota 2) and DeepMind AlphaStar trained on natural policy gradient inside a trust region. Without a KL constraint, policy collapse happens within tens of iterations.
**K-FAC at DeepMind.** AlphaGo Zero's policy network was trained with K-FAC. Sophia (Liu 2023) sped up GPT-2-scale pretraining 2x relative to Adam at the same FLOPs.
**RLHF in Claude/ChatGPT.** PPO clipping is a surrogate for the natural-gradient KL constraint: clipping the ratio $\frac{\pi_\theta}{\pi_{\text{old}}}$ is a coarse approximation of TRPO's trust region.

Предварительные знания

Definition: a covariant derivative for distributions

Vanilla gradient depends on parameterization

The ordinary gradient $\nabla L(\theta)$ is just the vector of partial derivatives in some coordinate system $\theta$. The problem: a change of parameterization changes the gradient. Pick $\theta=(\mu, \sigma)$ for a Gaussian or $\theta=(\mu, \log\sigma)$ - SGD takes different steps, follows different trajectories, converges at different speeds. Geometrically meaningless: the distribution is the same.

The natural gradient $\tilde{\nabla} L$ is the vanilla gradient pre-multiplied by the inverse of the Fisher information matrix $F(\theta)$. The key property: under reparameterization $F$ transforms as a metric tensor and $\nabla L$ as a covector, so the product is invariant. The step direction in distribution space is the same regardless of which coordinates do the bookkeeping.

**Fisher information matrix:** $F_{ij}(\theta) = \mathbb{E}_{x\sim p_\theta}\big[\partial_i \log p_\theta(x)\, \partial_j \log p_\theta(x)\big]$. It is the metric on the manifold of distributions, also called the Fisher-Rao metric.

Geometrically: natural gradient is steepest descent in KL divergence, not in Euclidean parameter norm. The step $\theta \leftarrow \theta - \eta \tilde{\nabla} L$ minimises loss among all steps with a fixed KL radius. Vanilla SGD minimises among steps with a fixed Euclidean radius - which is metrically arbitrary.

Where natural gradient came from

Amari (1998) formalised the natural gradient as a covariant derivative on a statistical manifold. The idea had been brewing for decades: Akaike (1981) used Fisher information for model selection (AIC), Kullback (1959) defined KL divergence as an information distance. Amari fused it into one framework: information geometry. The breakthrough moment came when Peters & Schaal (2008) showed on robotics that natural policy gradient converged 5-10x faster than vanilla policy gradient.

What separates the natural gradient $\tilde{\nabla} L = F^{-1} \nabla L$ from the vanilla gradient $\nabla L$?

Riemannian geometry and the Fisher inverse

The manifold of distributions and the Fisher metric

A parametric family $\{p(\cdot|\theta) : \theta \in \Theta\}$ is a multidimensional manifold: each point $\theta$ corresponds to a distribution. The Fisher metric $F(\theta)$ defines an inner product on the tangent space: the distance between nearby distributions is measured by KL, and to leading order KL is quadratic in $F$.

This identity is the heart of natural gradient. Minimising $L(\theta + \epsilon) - L(\theta) \approx \nabla L^T \epsilon$ subject to $D_{KL} \le \delta$ gives the direction $\epsilon^* \propto -F^{-1} \nabla L$. So natural gradient solves a trust-region problem in KL geometry.

**Geodesics and parallel transport.** SGD in the Euclidean metric follows straight lines in $\theta$-space - but a straight line in $\theta$ is not a straight line on the manifold of distributions. Natural gradient follows Fisher-Rao geodesics: shortest paths in KL distance. Christoffel symbols describe how tangent vectors are transported along curves; for an exponential family they have a closed form.

**Connection to TRPO in RL.** Trust Region Policy Optimization (Schulman 2015) solves $\max_\theta \mathbb{E}[A(s,a)]$ subject to $\mathbb{E}[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)] \le \delta$. Linearising the objective and quadratising the constraint produces the natural gradient. PPO is a cheap heuristic replacement of that constraint via clipping.

What does the Fisher metric $F(\theta)$ measure locally?

K-FAC and Shampoo: practical approximations

Why direct $F^{-1}$ is impossible

For a model with $d$ parameters the Fisher matrix has size $d \times d$. Storage costs $O(d^2)$, inversion costs $O(d^3)$. For GPT-2 small ($d \approx 10^8$) that is $10^{16}$ FLOPs per inversion every step - unfeasible. Every production-grade natural-gradient method approximates $F^{-1}$, trading accuracy against compute.

Diagonal Fisher — Keep only the diagonal of $F$. Cost: $O(d)$. Used in AdaGrad, RMSProp, and Adam (as a heuristic). Accuracy is poor - it ignores parameter correlations - but it is cheap.
Empirical Fisher — Replace $\mathbb{E}_{p_\theta}$ with a batch average: $F \approx \tfrac{1}{B}\sum \nabla \log p \cdot \nabla \log p^T$. Cheaper than the true Fisher, but biases the estimate when model and data mismatch.
K-FAC (Kronecker-Factored) — Per network layer, $F \approx A \otimes G$, where $A$ is the activation covariance and $G$ is the gradient covariance. Then $F^{-1} \approx A^{-1} \otimes G^{-1}$. Cost: $O(\text{layer\_dim}^3)$ instead of $O(d^3)$. Martens & Grosse 2015.
Shampoo / Sophia — Block-diagonal second-order preconditioners. Shampoo (Gupta 2018) uses Kronecker-factored statistics similar to K-FAC but as a general-purpose optimiser. Sophia (Liu 2023) is a diagonal Hessian estimator for LLM training.
Conjugate gradient — Solve $F x = g$ iteratively via matrix-vector products $F v$. Each $F v$ takes one extra backward pass without materialising $F$. Hessian-Free optimisation (Martens 2010).

**Answer to 'which approximation does Adam use?'.** Adam keeps a diagonal empirical Fisher (through $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$). It is natural gradient at its crudest. That is why Adam is often called a 'cheap natural gradient' - empirically close, but ignoring off-diagonal Fisher.

**Where K-FAC has actually shipped.** DeepMind used K-FAC for policy training in AlphaGo and AlphaZero. Sophia (Liu et al. 2023) sped up GPT-2-scale pretraining 2x compared to Adam at the same FLOPs. Shampoo was adopted by Google for PaLM. The trade-off: compiler and memory complexity grow, but wall-clock convergence improves at large batch sizes.

Natural gradient is always better than Adam, just too expensive.

Natural gradient and Adam solve different problems. Natural gradient is a principled second-order method; Adam is a diagonal-Fisher heuristic. When compute allows (RL with small networks, fine-tuning) and gradients are strongly correlated, natural gradient wins by 5-10x. For huge DNNs with billions of parameters and stochastic batches, Adam is often enough because off-diagonal Fisher is noisy and expensive to estimate.

It is not 'better/worse' but a trade-off between principled geometry and compute. K-FAC and Shampoo sit in the middle - block-diagonal approximations that bring natural gradient to large models.

Key ideas

Natural gradient $\tilde{\nabla}L = F^{-1}\nabla L$ is invariant under reparameterization - the step direction is identical in any coordinates
Geometrically it is steepest descent in KL divergence: $D_{KL}(p_\theta \| p_{\theta+\epsilon}) \approx \tfrac12 \epsilon^T F \epsilon$, and natural gradient solves the trust-region problem
Direct $F^{-1}$ costs $O(d^3)$ - neural nets need approximations: diagonal Fisher (Adam), K-FAC, Shampoo, Sophia, conjugate gradient
Natural gradient is the principled foundation of TRPO, PPO, natural policy gradient in RL and K-FAC/Shampoo in deep learning

Connections to other topics

Natural gradient is the central hub of information geometry: it links the Fisher metric to practical optimisation.

Fisher information metric — The metric in which natural gradient is steepest descent. Without Fisher, the formula $F^{-1}\nabla L$ has no meaning.
RL and bandits — TRPO, PPO, natural Q-learning - all are built on natural gradient. The trust-region KL constraint is exactly the Fisher metric.
Neural network training — K-FAC, Shampoo, Sophia are production-grade approximations of natural gradient for deep nets. AlphaGo and LLM pretraining use them at scale.

Вопросы для размышления

If parameterization does not affect natural gradient, why is vanilla SGD still the default - and in which scenarios is the parameterization arbitrariness harmless in practice?
Adam = diagonal empirical Fisher: is this a principled approximation or a coincidence? What is preserved and what is lost relative to full Fisher?
TRPO solves a trust-region problem explicitly; PPO replaces it with clipping. What geometric intuition underlies the clip ratio - and where does the approximation break?

Связанные уроки

ig-02-fisher-metric — Natural gradient is the gradient measured in the Fisher metric; without it the formula $F^{-1}\nabla L$ is meaningless
ig-04-kl-bregman — Local KL is quadratic in Fisher: $D_{KL}(p_\theta \| p_{\theta+\epsilon}) \approx \tfrac12 \epsilon^T F \epsilon$ - this is the justification for natural gradient
ig-05-dual-flat — Dually flat structure explains why natural gradient is invariant under reparameterization
ig-12-rl-bandits — TRPO, PPO, and natural policy gradient are direct applications of natural gradient in RL
ig-14-neural — K-FAC, Shampoo, and Sophia are practical approximations of natural gradient for deep nets
stat-01-sampling