Information Geometry
Natural gradient: a covariant derivative for distributions
In 1998 Amari posed a strange question: what does it mean to step by $\epsilon$ in distribution space? If parameterizations $\sigma$, $\log\sigma$, and $\sigma^2$ are all valid, three different SGDs trace three different trajectories. Natural gradient kills that arbitrariness: one geometry, one direction, invariant under reparameterization. The same gap separates Euclidean and Riemannian geometry - and for neural networks it is real, not academic.
- **TRPO and PPO in RL.** OpenAI Five (Dota 2) and DeepMind AlphaStar trained on natural policy gradient inside a trust region. Without a KL constraint, policy collapse happens within tens of iterations.
- **K-FAC at DeepMind.** AlphaGo Zero's policy network was trained with K-FAC. Sophia (Liu 2023) sped up GPT-2-scale pretraining 2x relative to Adam at the same FLOPs.
- **RLHF in Claude/ChatGPT.** PPO clipping is a surrogate for the natural-gradient KL constraint: clipping the ratio $\frac{\pi_\theta}{\pi_{\text{old}}}$ is a coarse approximation of TRPO's trust region.
Предварительные знания
Definition: a covariant derivative for distributions
Vanilla gradient depends on parameterization
The ordinary gradient $\nabla L(\theta)$ is just the vector of partial derivatives in some coordinate system $\theta$. The problem: a change of parameterization changes the gradient. Pick $\theta=(\mu, \sigma)$ for a Gaussian or $\theta=(\mu, \log\sigma)$ - SGD takes different steps, follows different trajectories, converges at different speeds. Geometrically meaningless: the distribution is the same.
The natural gradient $\tilde{\nabla} L$ is the vanilla gradient pre-multiplied by the inverse of the Fisher information matrix $F(\theta)$. The key property: under reparameterization $F$ transforms as a metric tensor and $\nabla L$ as a covector, so the product is invariant. The step direction in distribution space is the same regardless of which coordinates do the bookkeeping.
**Fisher information matrix:** $F_{ij}(\theta) = \mathbb{E}_{x\sim p_\theta}\big[\partial_i \log p_\theta(x)\, \partial_j \log p_\theta(x)\big]$. It is the metric on the manifold of distributions, also called the Fisher-Rao metric.
Geometrically: natural gradient is steepest descent in KL divergence, not in Euclidean parameter norm. The step $\theta \leftarrow \theta - \eta \tilde{\nabla} L$ minimises loss among all steps with a fixed KL radius. Vanilla SGD minimises among steps with a fixed Euclidean radius - which is metrically arbitrary.
Where natural gradient came from
Amari (1998) formalised the natural gradient as a covariant derivative on a statistical manifold. The idea had been brewing for decades: Akaike (1981) used Fisher information for model selection (AIC), Kullback (1959) defined KL divergence as an information distance. Amari fused it into one framework: information geometry. The breakthrough moment came when Peters & Schaal (2008) showed on robotics that natural policy gradient converged 5-10x faster than vanilla policy gradient.
What separates the natural gradient $\tilde{\nabla} L = F^{-1} \nabla L$ from the vanilla gradient $\nabla L$?
Riemannian geometry and the Fisher inverse
The manifold of distributions and the Fisher metric
A parametric family $\{p(\cdot|\theta) : \theta \in \Theta\}$ is a multidimensional manifold: each point $\theta$ corresponds to a distribution. The Fisher metric $F(\theta)$ defines an inner product on the tangent space: the distance between nearby distributions is measured by KL, and to leading order KL is quadratic in $F$.
This identity is the heart of natural gradient. Minimising $L(\theta + \epsilon) - L(\theta) \approx \nabla L^T \epsilon$ subject to $D_{KL} \le \delta$ gives the direction $\epsilon^* \propto -F^{-1} \nabla L$. So natural gradient solves a trust-region problem in KL geometry.
**Geodesics and parallel transport.** SGD in the Euclidean metric follows straight lines in $\theta$-space - but a straight line in $\theta$ is not a straight line on the manifold of distributions. Natural gradient follows Fisher-Rao geodesics: shortest paths in KL distance. Christoffel symbols describe how tangent vectors are transported along curves; for an exponential family they have a closed form.
**Connection to TRPO in RL.** Trust Region Policy Optimization (Schulman 2015) solves $\max_\theta \mathbb{E}[A(s,a)]$ subject to $\mathbb{E}[D_{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)] \le \delta$. Linearising the objective and quadratising the constraint produces the natural gradient. PPO is a cheap heuristic replacement of that constraint via clipping.
What does the Fisher metric $F(\theta)$ measure locally?
K-FAC and Shampoo: practical approximations
Why direct $F^{-1}$ is impossible
For a model with $d$ parameters the Fisher matrix has size $d \times d$. Storage costs $O(d^2)$, inversion costs $O(d^3)$. For GPT-2 small ($d \approx 10^8$) that is $10^{16}$ FLOPs per inversion every step - unfeasible. Every production-grade natural-gradient method approximates $F^{-1}$, trading accuracy against compute.
- Diagonal Fisher — Keep only the diagonal of $F$. Cost: $O(d)$. Used in AdaGrad, RMSProp, and Adam (as a heuristic). Accuracy is poor - it ignores parameter correlations - but it is cheap.
- Empirical Fisher — Replace $\mathbb{E}_{p_\theta}$ with a batch average: $F \approx \tfrac{1}{B}\sum \nabla \log p \cdot \nabla \log p^T$. Cheaper than the true Fisher, but biases the estimate when model and data mismatch.
- K-FAC (Kronecker-Factored) — Per network layer, $F \approx A \otimes G$, where $A$ is the activation covariance and $G$ is the gradient covariance. Then $F^{-1} \approx A^{-1} \otimes G^{-1}$. Cost: $O(\text{layer\_dim}^3)$ instead of $O(d^3)$. Martens & Grosse 2015.
- Shampoo / Sophia — Block-diagonal second-order preconditioners. Shampoo (Gupta 2018) uses Kronecker-factored statistics similar to K-FAC but as a general-purpose optimiser. Sophia (Liu 2023) is a diagonal Hessian estimator for LLM training.
- Conjugate gradient — Solve $F x = g$ iteratively via matrix-vector products $F v$. Each $F v$ takes one extra backward pass without materialising $F$. Hessian-Free optimisation (Martens 2010).
**Answer to 'which approximation does Adam use?'.** Adam keeps a diagonal empirical Fisher (through $v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2$). It is natural gradient at its crudest. That is why Adam is often called a 'cheap natural gradient' - empirically close, but ignoring off-diagonal Fisher.
**Where K-FAC has actually shipped.** DeepMind used K-FAC for policy training in AlphaGo and AlphaZero. Sophia (Liu et al. 2023) sped up GPT-2-scale pretraining 2x compared to Adam at the same FLOPs. Shampoo was adopted by Google for PaLM. The trade-off: compiler and memory complexity grow, but wall-clock convergence improves at large batch sizes.
Natural gradient is always better than Adam, just too expensive.
Natural gradient and Adam solve different problems. Natural gradient is a principled second-order method; Adam is a diagonal-Fisher heuristic. When compute allows (RL with small networks, fine-tuning) and gradients are strongly correlated, natural gradient wins by 5-10x. For huge DNNs with billions of parameters and stochastic batches, Adam is often enough because off-diagonal Fisher is noisy and expensive to estimate.
It is not 'better/worse' but a trade-off between principled geometry and compute. K-FAC and Shampoo sit in the middle - block-diagonal approximations that bring natural gradient to large models.
Which Fisher-matrix approximation does Adam use?
Key ideas
- Natural gradient $\tilde{\nabla}L = F^{-1}\nabla L$ is invariant under reparameterization - the step direction is identical in any coordinates
- Geometrically it is steepest descent in KL divergence: $D_{KL}(p_\theta \| p_{\theta+\epsilon}) \approx \tfrac12 \epsilon^T F \epsilon$, and natural gradient solves the trust-region problem
- Direct $F^{-1}$ costs $O(d^3)$ - neural nets need approximations: diagonal Fisher (Adam), K-FAC, Shampoo, Sophia, conjugate gradient
- Natural gradient is the principled foundation of TRPO, PPO, natural policy gradient in RL and K-FAC/Shampoo in deep learning
Connections to other topics
Natural gradient is the central hub of information geometry: it links the Fisher metric to practical optimisation.
- Fisher information metric — The metric in which natural gradient is steepest descent. Without Fisher, the formula $F^{-1}\nabla L$ has no meaning.
- RL and bandits — TRPO, PPO, natural Q-learning - all are built on natural gradient. The trust-region KL constraint is exactly the Fisher metric.
- Neural network training — K-FAC, Shampoo, Sophia are production-grade approximations of natural gradient for deep nets. AlphaGo and LLM pretraining use them at scale.
Вопросы для размышления
- If parameterization does not affect natural gradient, why is vanilla SGD still the default - and in which scenarios is the parameterization arbitrariness harmless in practice?
- Adam = diagonal empirical Fisher: is this a principled approximation or a coincidence? What is preserved and what is lost relative to full Fisher?
- TRPO solves a trust-region problem explicitly; PPO replaces it with clipping. What geometric intuition underlies the clip ratio - and where does the approximation break?
Связанные уроки
- ig-02-fisher-metric — Natural gradient is the gradient measured in the Fisher metric; without it the formula $F^{-1}\nabla L$ is meaningless
- ig-04-kl-bregman — Local KL is quadratic in Fisher: $D_{KL}(p_\theta \| p_{\theta+\epsilon}) \approx \tfrac12 \epsilon^T F \epsilon$ - this is the justification for natural gradient
- ig-05-dual-flat — Dually flat structure explains why natural gradient is invariant under reparameterization
- ig-12-rl-bandits — TRPO, PPO, and natural policy gradient are direct applications of natural gradient in RL
- ig-14-neural — K-FAC, Shampoo, and Sophia are practical approximations of natural gradient for deep nets
- stat-01-sampling