Information Geometry
Statistical Manifolds: e- and m-Geodesics
Variational inference in VAEs, the EM algorithm for Gaussian mixtures, and natural gradient in TRPO all rest on the same geometric idea: alternating projections onto dual flat manifolds. Amari showed that KL divergence is the Pythagorean theorem of this geometry.
- VAE (Kingma & Welling 2013): ELBO optimization is an e-projection of the approximating distribution onto the true posterior
- EM algorithm: the E-step is an m-projection, the M-step is an e-projection; monotonicity is guaranteed geometrically
- Mean-field variational inference in Pyro and NumPyro: each optimization step is a projection in e-flat geometry
The Exponential Family as an e-Flat Manifold
GPT-4 uses softmax, which is projection onto the simplex of probability distributions, an e-flat manifold. Adam converges faster than SGD precisely because it adaptively approximates the natural gradient on this manifold. Normal, Bernoulli, Poisson - all live in the same form: p(x;theta) = exp(theta^T T(x) - A(theta)) h(x).
Numerical example for Bernoulli(p): natural parameter theta = log(p/(1-p)) (log-odds). Moment eta = E[X] = p. Relation: p = sigma(theta) = 1/(1+e^{-theta}). For theta = 0: p = 0.5, eta = 0.5. For theta = 2: p = e^2/(1+e^2) approximately 0.88, eta = 0.88.
The e-geodesic between p(x; theta_0) and p(x; theta_1) is the distribution p(x; (1-t)*theta_0 + t*theta_1) for t in [0,1]. A straight mixture in the space of natural parameters. The exponential family is closed under such mixtures.
For the exponential family p(x;theta) = exp(theta*T(x) - A(theta)), what does the gradient dA/dtheta compute?
e- and m-Geodesics: Two Ways to Interpolate
On a statistical manifold there are two canonical types of geodesic. The e-geodesic (exponential): a mixture in natural parameters theta. The m-geodesic (mixture): a mixture in moments eta. They differ whenever the manifold is curved.
For a Gaussian with fixed variance, e- and m-geodesics coincide: interpolation in theta-space and in eta-space gives the same result. The distinction appears once you mix variances or move to other families.
The m-geodesic between p_0 and p_1 is:
The Pythagorean Theorem for KL Divergence
The central theorem of information geometry: for three points P_1, P_2, P_3 on a statistical manifold, if P_3 is the m-projection of P_2 onto an e-flat submanifold containing P_1, then the KL divergences satisfy a Pythagorean identity.
Numerical example: P_2 = N(1, 2), e-flat manifold Q = {N(mu, 1): mu in R} (all normals with sigma=1). The m-projection of P_2 onto Q: find N(mu*, 1) minimizing KL(P_2 || N(mu*, 1)) = N(1, 1). Check: KL(N(1,2) || N(1,1)) + KL(N(1,1) || N(0,1)) = KL(N(1,2) || N(0,1)).
The KL Pythagorean theorem explains why projecting onto a manifold reduces total KL: the orthogonal components separate. This underlies variational inference: minimizing KL(q||p) over q in a restricted class is exactly an m-projection.
In variational inference we minimize KL(q(z) || p(z|x)) over q in the class of factorized distributions. This is:
EM Algorithm: Geometry of Alternating Projections
The EM algorithm for Gaussian mixtures converges - but why? The classical proof uses the ELBO lower bound. The geometric proof: the E-step and M-step are alternating m- and e-projections, and the Pythagorean theorem guarantees a monotone decrease in KL.
E-step: fix theta, update the distribution over latents q(z) = p(z|x, theta). This is an m-projection of the joint p(x, z; theta) onto the manifold consistent with the observed x. M-step: fix q(z), maximize the expected log-likelihood. This is an e-projection onto the manifold of admissible theta.
Monotonicity of EM: the Pythagorean theorem guarantees KL(q || p_new) <= KL(q || p_old) after the M-step. Convergence is only to a local maximum, not the global one. Different initializations give different solutions - this is not a bug of the algorithm, it is the geometry of a non-convex manifold.
Why does the log-likelihood rise monotonically at every EM iteration?
Итоги
- e-geodesics: straight lines in natural parameters $\theta$; exponential family $p(x;\theta) = \exp(\theta^\top T(x) - A(\theta))$
- m-geodesics: straight lines in moments $\eta = \nabla A(\theta)$; mixtures of distributions
- Pythagorean theorem: $\mathrm{KL}(P_2 \| P_1) = \mathrm{KL}(P_2 \| P_3) + \mathrm{KL}(P_3 \| P_1)$ when $P_3$ is the m-projection of $P_2$ onto the e-flat submanifold containing $P_1$
- EM as alternating projections: E-step (m-projection), M-step (e-projection); monotonicity follows from the Pythagorean theorem
Related topics
Dual geometry unifies optimization and statistics:
- Dually flat manifolds — Foundation of e/m structure
- Information projections — e- and m-projections as the core operations
- Natural gradient — Natural gradient is the gradient in e-geometry
Вопросы для размышления
- In a VAE we minimize KL(q(z|x) || p(z|x)). Is this an e-projection or an m-projection? What happens if you swap the order of the KL arguments?
- EM provably converges to a local maximum of the likelihood. How does the KL Pythagorean theorem explain this monotonicity? When does monotonicity break?
- Mean-field assumes the factorization q(z) = prod q_i(z_i). Does this restriction define an e-flat or m-flat submanifold?
Связанные уроки
- ig-05-dual-flat — Dually flat structure is the foundation for e/m-geodesics
- ig-08-info-projection — Projection onto a submanifold is the key operation
- ig-14-neural — NTK is a metric on function space - similar duality