Information Geometry

Alpha-Divergences and Dual Geometry

Variational inference in VAE (2013, DeepMind) minimizes KL divergence , a special case of alpha-divergence at alpha->1. The choice of alpha determines mass-covering vs mode-seeking behavior: alpha=-1 covers all modes of the posterior, alpha=1 collapses to one.

VAE (Kingma & Welling, 2013): ELBO = -KL(q||p) + E[log p(x|z)], KL is the alpha=1 divergence
Power EP (Minka, 2004): alpha=-1 for robust multi-modal posterior approximation
Renyi alpha-divergence (Li & Turner, 2016): generalized ELBO for alpha != 1 with better posterior coverage

Alpha-Divergences: A Unified Family

Variational Autoencoder in Stable Diffusion 3 (2024) minimizes α-divergence: 8B parameters, KL loss as the α→1 special case. VAE minimizes KL(q||p), forcing the variational distribution to collapse to one mode. Increase alpha to -1 and the optimizer covers all modes of the posterior. These are not different methods but one family: alpha-divergences unify KL, Hellinger, chi-squared, and reverse KL into a single geometric structure.

The Renyi divergence R_alpha(p||q) = 1/(alpha-1) * log integral(p^alpha * q^{1-alpha}) is closely related to alpha-divergence. The VAE ELBO is a lower bound on -R_1; the Renyi ELBO (Li & Turner 2016) generalizes this to arbitrary alpha.

Why does variational inference with KL(q||p) (alpha=1) lead to mode-seeking behavior?

Dual Flatness and the Pythagorean Theorem

In Euclidean space the Pythagorean theorem states |AC|^2 = |AB|^2 + |BC|^2 when the angle at B is a right angle. On a statistical manifold the analogue is: D(p||r) = D(p||q) + D(q||r) when q is the projection of r onto an e-flat submanifold and r is the projection onto an m-flat one. This is the geometric foundation of the EM algorithm.

The Pythagorean theorem in information geometry explains EM convergence: each pair of E/M steps strictly decreases the KL divergence between the current model and the true distribution.

In the EM algorithm the E-step computes q(z|x) and the M-step maximizes the ELBO. What is the information-geometric interpretation?

Bregman Divergences and Exponential Families

KL divergence is a Bregman divergence generated by the negative entropy F(p) = sum(p log p). This is not a coincidence: for any exponential family KL equals the Bregman divergence of the log-partition function A(theta). This gives a unified view of k-means, EM, logistic regression, and variational inference.

k-means is the Bregman divergence of F = squared norm / 2. Soft k-means with KL is EM for a Boltzmann mixture. One mathematical structure, many applications.

The Bregman divergence B_F(y||x) = F(y)-F(x)-nabla F(x)^T(y-x). With F(eta) = sum(eta_i log eta_i) (negative entropy), what do one get?

Итоги

Alpha-divergence: D_alpha(p||q) = 4/(1-alpha^2) [1 - integral p^{(1+alpha)/2} q^{(1-alpha)/2} dx]
Limits: alpha->1 gives KL(p||q), alpha->-1 gives KL(q||p), alpha=0 gives 2*Hellinger^2
Pythagorean theorem: for e/m-flat submanifolds D(p||r) = D(p||q*) + D(q*||r)
KL as Bregman divergence: KL(q||p) = B_F(eta_q||eta_p) where F = negative entropy

Вопросы для размышления

Why does variational inference with KL(q||p) at alpha=1 lead to mode-seeking, while alpha=-1 gives mass-covering?
How does the Pythagorean theorem in information geometry underlie the EM algorithm?
What does it mean for KL divergence to be a Bregman divergence?

Связанные уроки

ig-04-kl-bregman — ig-19 generalizes KL to the full alpha-divergence family
ig-05-dual-flat — Dual flatness underlies the geometry of alpha-divergences

Alpha-Divergences: A Unified Family

Why does variational inference with KL(q||p) (alpha=1) lead to mode-seeking behavior?

Dual Flatness and the Pythagorean Theorem

The Pythagorean theorem in information geometry explains EM convergence: each pair of E/M steps strictly decreases the KL divergence between the current model and the true distribution.

In the EM algorithm the E-step computes q(z|x) and the M-step maximizes the ELBO. What is the information-geometric interpretation?

Bregman Divergences and Exponential Families

k-means is the Bregman divergence of F = squared norm / 2. Soft k-means with KL is EM for a Boltzmann mixture. One mathematical structure, many applications.

The Bregman divergence B_F(y||x) = F(y)-F(x)-nabla F(x)^T(y-x). With F(eta) = sum(eta_i log eta_i) (negative entropy), what do one get?

Итоги

Alpha-divergence: D_alpha(p||q) = 4/(1-alpha^2) [1 - integral p^{(1+alpha)/2} q^{(1-alpha)/2} dx]

Limits: alpha->1 gives KL(p||q), alpha->-1 gives KL(q||p), alpha=0 gives 2*Hellinger^2

Pythagorean theorem: for e/m-flat submanifolds D(p||r) = D(p||q*) + D(q*||r)

KL as Bregman divergence: KL(q||p) = B_F(eta_q||eta_p) where F = negative entropy