Information Geometry
Alpha-Divergences and Dual Geometry
Variational inference in VAE (2013, DeepMind) minimizes KL divergence , a special case of alpha-divergence at alpha->1. The choice of alpha determines mass-covering vs mode-seeking behavior: alpha=-1 covers all modes of the posterior, alpha=1 collapses to one.
- VAE (Kingma & Welling, 2013): ELBO = -KL(q||p) + E[log p(x|z)], KL is the alpha=1 divergence
- Power EP (Minka, 2004): alpha=-1 for robust multi-modal posterior approximation
- Renyi alpha-divergence (Li & Turner, 2016): generalized ELBO for alpha != 1 with better posterior coverage
Alpha-Divergences: A Unified Family
Variational Autoencoder in Stable Diffusion 3 (2024) minimizes α-divergence: 8B parameters, KL loss as the α→1 special case. VAE minimizes KL(q||p), forcing the variational distribution to collapse to one mode. Increase alpha to -1 and the optimizer covers all modes of the posterior. These are not different methods but one family: alpha-divergences unify KL, Hellinger, chi-squared, and reverse KL into a single geometric structure.
The Renyi divergence R_alpha(p||q) = 1/(alpha-1) * log integral(p^alpha * q^{1-alpha}) is closely related to alpha-divergence. The VAE ELBO is a lower bound on -R_1; the Renyi ELBO (Li & Turner 2016) generalizes this to arbitrary alpha.
Why does variational inference with KL(q||p) (alpha=1) lead to mode-seeking behavior?
Dual Flatness and the Pythagorean Theorem
In Euclidean space the Pythagorean theorem states |AC|^2 = |AB|^2 + |BC|^2 when the angle at B is a right angle. On a statistical manifold the analogue is: D(p||r) = D(p||q) + D(q||r) when q is the projection of r onto an e-flat submanifold and r is the projection onto an m-flat one. This is the geometric foundation of the EM algorithm.
The Pythagorean theorem in information geometry explains EM convergence: each pair of E/M steps strictly decreases the KL divergence between the current model and the true distribution.
In the EM algorithm the E-step computes q(z|x) and the M-step maximizes the ELBO. What is the information-geometric interpretation?
Bregman Divergences and Exponential Families
KL divergence is a Bregman divergence generated by the negative entropy F(p) = sum(p log p). This is not a coincidence: for any exponential family KL equals the Bregman divergence of the log-partition function A(theta). This gives a unified view of k-means, EM, logistic regression, and variational inference.
k-means is the Bregman divergence of F = squared norm / 2. Soft k-means with KL is EM for a Boltzmann mixture. One mathematical structure, many applications.
The Bregman divergence B_F(y||x) = F(y)-F(x)-nabla F(x)^T(y-x). With F(eta) = sum(eta_i log eta_i) (negative entropy), what do one get?
Итоги
- Alpha-divergence: D_alpha(p||q) = 4/(1-alpha^2) [1 - integral p^{(1+alpha)/2} q^{(1-alpha)/2} dx]
- Limits: alpha->1 gives KL(p||q), alpha->-1 gives KL(q||p), alpha=0 gives 2*Hellinger^2
- Pythagorean theorem: for e/m-flat submanifolds D(p||r) = D(p||q*) + D(q*||r)
- KL as Bregman divergence: KL(q||p) = B_F(eta_q||eta_p) where F = negative entropy
Вопросы для размышления
- Why does variational inference with KL(q||p) at alpha=1 lead to mode-seeking, while alpha=-1 gives mass-covering?
- How does the Pythagorean theorem in information geometry underlie the EM algorithm?
- What does it mean for KL divergence to be a Bregman divergence?
Связанные уроки
- ig-04-kl-bregman — ig-19 generalizes KL to the full alpha-divergence family
- ig-05-dual-flat — Dual flatness underlies the geometry of alpha-divergences