Information Geometry
Dual flat structure: geometry of duality
The EM algorithm, natural gradient, mirror descent - all three trace back to Amari's dual geometry. E-step is an m-projection. M-step is an e-projection. EM convergence is the Pythagorean theorem for KL divergence. Not a metaphor. Literal $a^2 + b^2 = c^2$ - but for probability distributions. Shun-ichi Amari proved this in Tokyo in 1985. Four decades later, every major neural network optimizer exploits this geometry in one form or another.
- **Natural gradient / K-FAC**: the natural gradient moves along e-geodesics on the manifold of network parameters. This is why K-FAC converges faster than Adam when the problem has a clear exp-family structure (VAE, normalizing flows)
- **EM algorithm (GMM, HMM, LDA)**: each EM iteration alternates an e-projection (M-step) and an m-projection (E-step). Monotone convergence is a direct corollary of Amari's Pythagorean theorem
- **Mirror descent and Sinkhorn**: mirror descent is a sequence of Bregman m-projections. Sinkhorn iterations converge because they alternate Bregman projections onto e- and m-flat constraints
Предварительные знания
- KL divergence and Bregman divergence
- Exponential family and natural parameters
- Fisher metric
e- and m-connections
In ordinary space there is one geodesic between two points - the straight line. On a statistical manifold there are two. Amari proved this in 1985.
The **e-connection** (exponential) defines parallel transport along the exponential family: $p_t = p^{1-t} q^t / Z_t$. Geodesics are exponential arcs in the space of distributions. The Riemann curvature tensor vanishes under this connection - the manifold is e-flat.
The **m-connection** (mixture) defines a different parallel transport: $p_t = (1-t)p + tq$. Geodesics are mixture paths. Curvature also vanishes - the manifold is m-flat.
**Key property**: the e-connection and m-connection are dual with respect to the Fisher metric. What looks straight under e-structure curves under m-structure, and vice versa. Two views of the same manifold - like Cartesian and polar coordinates, but for probability spaces.
The natural gradient in optimizers like K-FAC follows which geodesic?
Dual coordinates: theta and eta
The exponential family $p(x; \theta) = h(x) \exp(\langle \theta, T(x) \rangle - A(\theta))$ lives in two coordinate systems simultaneously.
**Natural parameters** $\theta$ are the e-coordinates - affinely flat under e-structure: e-geodesics appear straight. For a Gaussian: $\theta_1 = \mu/\sigma^2$, $\theta_2 = -1/(2\sigma^2)$.
**Expectation parameters** $\eta = \mathbb{E}_{\theta}[T(x)] = \nabla A(\theta)$ are the m-coordinates - affinely flat under m-structure: m-geodesics appear straight. For a Gaussian: $\eta_1 = \mu$, $\eta_2 = \mu^2 + \sigma^2$.
**Connection via log-partition**: $\eta = \nabla A(\theta)$ and $\theta = \nabla A^*(\eta)$, where $A^*$ is the Legendre conjugate of $A$. This is Legendre duality in geometric form. The transition $\theta \leftrightarrow \eta$ acts as an information-geometric Fourier transform.
In the E-step of the EM algorithm, Q(theta) = E_{p(z|x,theta_old)}[log p(x,z|theta)] is computed. What type of projection does this correspond to?
Pythagorean theorem for KL divergence
1985. Amari proves that under specific conditions, KL divergence behaves exactly like the square of a distance. Not a metaphor - a literal Pythagorean theorem.
Let $q$ be the m-projection of $p$ onto an e-flat submanifold $\mathcal{S}$, i.e. $q = \arg\min_{r \in \mathcal{S}} D_{KL}(r \| p)$. Then:
**Amari's Pythagorean theorem**: if $q$ is the m-projection of $p$ onto e-flat $\mathcal{S}$, then for any $r \in \mathcal{S}$: $$D_{KL}(p \| r) = D_{KL}(p \| q) + D_{KL}(q \| r)$$ No gap. A complete generalization of $|pr|^2 = |pq|^2 + |qr|^2$ to the space of probability distributions.
This is exactly why EM converges. Each iteration: E-step (m-projection) + M-step (e-projection). The Pythagorean theorem guarantees that $D_{KL}(p_{true} \| p_{model})$ decreases monotonically. Not a heuristic - a geometric fact.
The Pythagorean theorem for KL is only an approximation valid near a reference point
It is an exact equality whenever the e- and m-geodesics meet orthogonally
The asymptotic approximation $D_{KL} \approx \frac{1}{2} \Delta\theta^T F \Delta\theta$ holds only near a point. Amari's generalized Pythagorean theorem is an exact geometric statement about dually flat manifolds - no asymptotics required.
The EM algorithm's convergence is explained by the Pythagorean theorem because...
Key ideas
- **e-connection** defines geodesics via exponential arcs ($p^{1-t}q^t$), **m-connection** via mixtures ($(1-t)p + tq$). Both yield zero curvature - two flat spaces on one manifold
- **Dual coordinates**: $\theta$ (natural parameters) are e-flat, $\eta = \nabla A(\theta)$ (expectations) are m-flat. They are linked by Legendre duality: $\theta = \nabla A^*(\eta)$
- **Amari's Pythagorean theorem**: $D_{KL}(p \| r) = D_{KL}(p \| q) + D_{KL}(q \| r)$ when geodesics meet orthogonally. An exact equality, not an approximation
- **EM = alternating projections**: E-step is an m-projection, M-step is an e-projection. Monotone convergence is a geometric guarantee, not a heuristic
What comes next
Dual flat structure is the foundation for applied tools in information geometry:
- Natural gradient — Motion along e-geodesics on the parametric manifold
- Information projections — e- and m-projections as optimization in the space of distributions
- Mirror descent — Bregman projection as m-projection in convex optimization
- Optimal transport: dual formulation — Primal-dual structure - a related idea for the transport problem
Вопросы для размышления
- The EM algorithm can get stuck in local maxima. The Pythagorean theorem guarantees monotone convergence - where is the contradiction?
- Mirror descent and natural gradient both exploit the dual structure. How do their geodesic paths in the space of distributions fundamentally differ?
- If e-coordinates are flat for the e-structure and m-coordinates are flat for the m-structure, what happens if one tries to build a single coordinate system flat under both simultaneously?
Связанные уроки
- ig-07-natural-gradient — Natural gradient follows e-geodesics - direct consequence of dual structure
- ig-09-mirror-descent — Mirror descent is an m-projection (Bregman), a special case of dual geometry
- ig-04-kl-bregman — Duality is expressed via Bregman divergence / KL
- ig-08-info-projection — e- and m-projections are the central tool of information geometry
- prob-25-info-theory — Pythagorean theorem for KL connects to projection theorems in information theory
- stat-01-sampling