Information Geometry

Natural Gradient in Deep Learning

Adam is the most popular optimizer. But Adam is a diagonal approximation of natural gradient that ignores cross-parameter correlations. K-FAC uses a Kronecker-factored Fisher matrix and accelerates convergence by 2-4x. Distributed Shampoo scales to 100B+ parameters and was used in Gemini training. These are not academic exercises - they are millions of dollars in compute savings.

K-FAC (Martens & Grosse 2015): training ResNet-50 on ImageNet in 2.5x fewer steps than Adam
Distributed Shampoo (Google 2023): used in training AlphaFold 2 and partially Gemini
Sophia optimizer (Stanford 2023): diagonal Hessian approximation, 2x faster LLM training at GPT-2 scale

K-FAC: Kronecker Factorization of the Fisher Matrix

For a neural network with L layers the Fisher matrix is approximately block-diagonal across layers. In each layer l: input activation a_{l-1} in R^n, output gradient g_l in R^m, weight W in R^{m x n}. The Fisher block for layer l: F_l = E[g_l g_l^T] tensor E[a_{l-1} a_{l-1}^T] = G_l tensor A_l.

The damping parameter in K-FAC plays the role of a learning-rate scale. Too small a damping causes instability (the Fisher matrix is singular). Too large and you slide back toward plain SGD. Adaptive damping (Levenberg-Marquardt rule) automatically tunes it from the change in loss.

What is the computational advantage of K-FAC over the exact natural gradient?

Shampoo and Distributed Shampoo: Scaling Up

Shampoo (Gupta et al. 2018): for a weight matrix W in R^{m x n} it maintains two accumulated factors L = sum G_t G_t^T and R = sum G_t^T G_t. Update: W = W - lr * L^{-1/4} G R^{-1/4}. This is a full-matrix preconditioner for each weight matrix.

Distributed Shampoo (2023): the Kronecker factors L and R for a large layer (say an embedding with dim = 32000) require O(32000^2) = 1 GB of memory each. The fix: shard rows/columns of L and R across GPUs via ZeRO-style partitioning. Each GPU stores its slice and computes a partial root. This lets Shampoo apply to transformer models with 100B parameters.

Shampoo updates weights as W -= lr * L^{-1/4} * G * R^{-1/4}. Why the 1/4 power rather than 1/2?

Riemannian SGD and Optimization on Manifolds

Some neural-network parameters do not live in Euclidean space. Stiefel-manifold parameters (orthogonal matrices): Q^T Q = I. Hyperbolic embeddings: norm strictly less than 1. Probability simplex: sum p_i = 1. For such parameters natural gradient descent requires a retraction back onto the manifold.

Why can ordinary Adam not be used to optimize hyperbolic embeddings?

Adam as an Approximation of Natural Gradient

Adam: theta = theta - lr * m_t / (sqrt(v_t) + eps). Here v_t = E[g^2] is a diagonal estimate of the second moment of the gradient. The diagonal of the Fisher matrix: F_{ii} = E[(d log p / d theta_i)^2] approximately E[g_i^2] = v_t. So Adam is a diagonal approximation of the natural gradient.

What Adam misses: (1) cross-correlations between parameters (K-FAC captures them). (2) The structure of the Fisher matrix (not just the second moment of the gradient). (3) The geometry of the parameter manifold for structured parameters. That is why K-FAC and Shampoo beat Adam on tasks with strong inter-parameter correlations.

Despite the theoretical superiority of K-FAC, Adam often wins in practice because of (1) ease of tuning (one lr versus lr + damping + update_freq), (2) lower memory cost, (3) good behaviour with adaptivity across diverse tasks. K-FAC pays off only on very long training runs where step savings outweigh compute cost.

The diagonal Fisher F_{ii} = E[(d log p/d theta_i)^2]. How does this relate to v_t in Adam?

Итоги

K-FAC: $F \approx A \otimes G$ per layer; natural gradient $= G^{-1} W_{\text{grad}} A^{-1}$; $O(\sqrt{n}^2)$ instead of $O(n^2)$
Shampoo: full-matrix preconditioner $W \to L^{-1/4} G R^{-1/4}$; Distributed Shampoo shards the factors across GPUs
Riemannian SGD: update $\theta \leftarrow \exp_{\theta}(-\eta \cdot F^{-1}\nabla L)$ with retraction onto the manifold
Adam as approximation: $g/(\sqrt{v}+\epsilon)$ is diagonal rescaling without cross-correlations; K-FAC captures block structure

Вопросы для размышления

K-FAC requires O(n^2) memory per block. At what layer size does this become impractical, and how does Distributed Shampoo address this through sharding?
Adam often outperforms K-FAC in practice despite the theoretical superiority of natural gradient. What properties of real-world tasks explain this paradox?
Sophia uses a diagonal Hessian instead of the Fisher matrix. In what sense is this better than Adam? What is lost compared to K-FAC?

Связанные уроки

ig-07-natural-gradient — ig-17 covers production-grade natural-gradient implementations
ig-14-neural — K-FAC and Shampoo originate in ig-14
ig-16-alpha-divergences — Alpha-geometry explains the choice of metric