Information Theory
Information Geometry and the Fisher Matrix
Why does Adam work better than SGD? Why does the EM algorithm converge monotonically? Why does TRPO train policies without catastrophic forgetting? Behind all these questions lies one idea: the space of probability distributions is not flat-it is a curved Riemannian manifold.
- **Neural network optimization**: K-FAC uses the Fisher matrix to accelerate training by approximating the natural gradient
- **Reinforcement Learning**: TRPO and PPO use the Fisher metric for safe policy updates via KL constraints
- **EM and Bayesian ML**: the geometric interpretation of EM explains its monotonicity and enables new algorithmic generalizations
Предварительные знания
Statistical Manifold
All Gaussian distributions N(μ, σ²) form a 2D manifold - but it is curved. The Fisher-Rao distance between N(0,1) and N(1,1) differs from the distance between N(0,100) and N(1,100) by a factor of 100. K-FAC, used by DeepMind for AlphaGo Zero, exploits this curvature to compute natural gradients 10x faster than naive Fisher inverse.
The key question: how to **properly** measure the closeness of two distributions in parameter space? Euclidean distance ||(μ₁,σ₁) - (μ₂,σ₂)|| is poor: as σ → ∞, two distributions with different μ become indistinguishable, yet their Euclidean distance in μ stays constant.
Information geometry was created by Shun-ichi Amari (works from the 1980s, book 1985). He showed that the space of probability distributions has a **dual structure**: two complementary curvatures (e-flat and m-flat geometries). This led to the concept of the natural gradient and its applications in neural networks.
Why does Euclidean distance in parameter space poorly reflect the 'closeness' of distributions?
Fisher Matrix as a Riemannian Metric
**Fisher information** measures how sensitive a distribution is to changes in the parameter θ. If a small change in θ dramatically changes the distribution, the Fisher information is large. This is exactly the correct local metric on the statistical manifold.
The connection to KL-divergence: the Fisher matrix is the **second derivative** of KL(p(θ) || p(θ+dθ)) with respect to dθ at zero. This makes it the **canonical** Riemannian metric on the space of probability distributions-invariant to reparametrization.
The Fisher matrix sets the fundamental bound on estimation accuracy-the **Cramér-Rao inequality**: the variance of any unbiased estimator θ̂ cannot be smaller than 1/F(θ). This is a lower bound for all statistical methods. Maximally efficient estimators (MLE as n → ∞) achieve this bound.
Fisher information F(p) for Bernoulli(p) is maximum at p = 0.5 or p = 0.1?
Natural Gradient in Machine Learning
Ordinary gradient descent works in **Euclidean** parameter space: θ ← θ - η ∇L(θ). But neural network parameters define a distribution, and the parameter space is a Riemannian manifold with the Fisher metric. The **natural gradient** corrects this.
Trust Region Policy Optimization (TRPO, Schulman 2015) uses the Fisher metric for safe policy updates in reinforcement learning. The step constraint: KL(π_old || π_new) ≤ δ is exactly a distance constraint in the Riemannian geometry. PPO approximates TRPO via clipping while preserving the idea. Natural gradient is what makes RL training stable.
What is the main advantage of the natural gradient over the ordinary gradient?
Dual Geometry: e-Flatness and m-Flatness
Amari's most surprising discovery: a statistical manifold has a **dual affine structure**. There exist two mutually dual flat spaces-the e-flat (exponential families) and m-flat (mixtures). Projections onto these spaces correspond to maximum likelihood estimation and the EM algorithm respectively.
| Concept | e-geometry | m-geometry |
|---|---|---|
| Coordinates | Natural θ | Moment η = E[T(x)] |
| Geodesics | Mixtures in natural parameters | Mixtures in moments |
| Projection | ML estimation (MLE) | Moment matching |
| Algorithm | E-step of EM | M-step of EM |
| Linking map | ∂A/∂θ = η | Forward and inverse |
The **EM algorithm** is alternating projections in dual geometry! The E-step is a projection onto the m-flat submanifold (updating expectations), the M-step is projection back onto the e-flat submanifold (likelihood maximization). The convergence guarantee of EM follows from this geometry.
Choosing a divergence in ML is choosing a geometry of optimization. KL(p||q) is used in VAEs (Evidence Lower Bound), KL(q||p) is used in variational Bayes with mode-seeking, Amari's α-divergences generalize both. Wasserstein distance (GAN) is another geometry, particularly good for sparse distributions.
The EM algorithm can be interpreted through information geometry as:
Key Ideas
- **Statistical manifold**: the space of parametric distributions with a Riemannian structure
- **Fisher matrix** F(θ)ᵢⱼ = E[(∂ log p/∂θᵢ)(∂ log p/∂θⱼ)] - canonical metric (≈ curvature of KL)
- **Natural gradient** = F⁻¹∇L - invariant to reparametrization, the basis of K-FAC, TRPO
- **Duality**: e-geometry (exponential families) + m-geometry (mixtures) explains the EM algorithm
Related Topics
Information geometry connects information theory to differential geometry and ML:
- KL-Divergence and Mutual Information — Fisher matrix is the second derivative of KL-geometric basis of divergences
- IT in Statistics: Cramér-Rao — The Cramér-Rao inequality is a direct consequence of the Fisher metric
- Differential Entropy — h(X) is the scalar analogue; Fisher matrix is its tensor generalization
Вопросы для размышления
- The Adam optimizer uses a diagonal approximation of the second derivative of the loss. In what sense is this related to the Fisher matrix? What does Adam miss?
- Dual geometry explains EM. How could this same idea be used to develop new optimization algorithms?
- KL(p||q) ≠ KL(q||p) - it is asymmetric. What is the geometric meaning of this asymmetry? When does the order of arguments matter?