Information Geometry

Alpha-Divergences and Generalized Geometry

VAEs minimize KL(q||p) and produce blurry generations. GANs minimize Jensen-Shannon and produce sharp but unstable ones. f-VAE, alpha-divergence VI, and Wasserstein autoencoders are all attempts to find the right divergence. Amari showed there is a single parametric family that contains all of these cases.

  • VAE: alpha = 1 (reverse KL), mode-seeking; beta-VAE: weighted KL is regularization in alpha-geometry
  • Renyi-alpha divergence in privacy: differential privacy uses Renyi divergence to analyze information leakage
  • Alpha-divergence VI (Li & Turner 2016): alpha = 0.5 gives intermediate quality between VAE and importance sampling

Amari's Family of Alpha-Divergences

KL(P||Q) and KL(Q||P) are different divergences with different properties. Amari combined them into a single parametric family with parameter alpha in (-inf, inf). At alpha = 1: forward KL. At alpha = -1: reverse KL. At alpha = 0: a symmetric version.

Alpha = 0 corresponds to twice the squared Hellinger distance: D^(0) = 2*(1 - integral sqrt(p*q)). It is the only symmetric alpha-divergence: D^(0)(P||Q) = D^(0)(Q||P). All others are asymmetric.

D^(alpha = 1)(P||Q) = ?

Mode-Seeking vs Mode-Covering: VAE in Practice

The main practical distinction: KL(q||p) is mode-seeking (alpha = -1), KL(p||q) is mode-covering (alpha = +1). If p has two modes and q is unimodal, minimizing KL(q||p) selects one of the modes. Minimizing KL(p||q) stretches q to cover both modes (at the cost of a dip between them).

Why VAEs produce blurry images: the ELBO is the implicit minimization of KL(q(z|x) || p(z|x)), decomposed as the reconstruction term log p(x|z) plus the explicit prior-matching regularizer KL(q(z|x) || p(z)). The blur does not come from the regularizer; it comes from the reconstruction term being a Gaussian likelihood p(x|z) = N(x; decoder(z), sigma^2 I), whose MLE is the conditional mean and therefore averages across data modes. Replacing the Gaussian likelihood with a more peaked one (discretized logistic, PixelCNN decoder, adversarial loss) recovers sharp samples, confirming that the mode-covering behavior lives in the likelihood, not the KL term.

Alpha-divergence variational inference (alpha-VI) with alpha = 0 or alpha = 0.5 gives an intermediate result: partial mode coverage, partial concentration. The Edward2 library (TensorFlow Probability) implements alpha-VI for experimentation with different alpha.

A VAE minimizes KL(q(z|x) || p(z|x)). Is this mode-seeking or mode-covering, and what does it mean in practice?

Renyi Divergence and Differential Privacy

Renyi divergence of order alpha: R_alpha(P||Q) = 1/(alpha-1) * log(integral p^alpha * q^(1-alpha)). As alpha -> 1: R_1 = KL(P||Q). At alpha = 2: R_2 is a chi-squared statistic. As alpha -> inf: R_inf = sup log(p(x)/q(x)), the worst case.

Renyi-DP (Mironov 2017) is used in PyTorch Opacus and TensorFlow Privacy for training neural networks with differential privacy. Renyi divergence is more convenient than KL for DP analysis: it composes additively under repeated application of the algorithm (subcomposition).

Why is Renyi divergence more convenient than KL for analyzing differential privacy?

Alpha-Connections: Geometry Between e and m

Amari introduced a parametric family of affine connections on a statistical manifold, parameterized by alpha. At alpha = +1: the e-connection (exponential geometry). At alpha = -1: the m-connection (mixture geometry). At alpha = 0: the Levi-Civita connection (Riemannian geometry).

Duality: the e-connection and the m-connection are dual with respect to the Fisher metric: g(nabla^(alpha) X, Y) + g(X, nabla^(-alpha) Y) = X g(Y, Z). This is the definition of dual connections. A space with a pair of dual connections is called dually flat if both curvatures vanish - precisely the statistical manifold of an exponential family.

In practice: alpha = 0 (Riemannian geometry) gives Amari's natural gradient. alpha = 1 (e-geometry) gives optimization in natural-parameter space (K-FAC). The Geomstats library (Python) implements alpha-connections for exponential families.

Which alpha-connection does Amari's natural gradient correspond to?

Итоги

  • Amari's alpha-divergence: $D^{(\alpha)}(P \| Q) = \frac{4}{1-\alpha^2}\left(1 - \int p(x)^{(1+\alpha)/2} q(x)^{(1-\alpha)/2} dx\right)$
  • As $\alpha \to 1$: $D^{(1)} = \mathrm{KL}(P \| Q)$; as $\alpha \to -1$: $D^{(-1)} = \mathrm{KL}(Q \| P)$
  • Alpha = +1: zero-avoiding (mode-covering), q covers all modes of p; alpha = -1: zero-forcing (mode-seeking), q concentrates on a single mode
  • Renyi divergence of order alpha is related to Amari's alpha-divergence via a monotone transformation

Related topics

Alpha-divergences unify geometry and statistics:

  • KL divergence and Bregman — KL is a boundary case of alpha-divergence
  • e/m-projections — Alpha-projections generalize e- and m-projections
  • Generative models — Choice of divergence determines generation quality

Вопросы для размышления

  • VAEs produce blurry images. How would quality change if you moved from KL(q||p) to KL(p||q)? Why is this not done directly in practice?
  • Renyi-alpha divergence is used for differential privacy analysis. Why specifically Renyi rather than KL?
  • If alpha-divergences with different alpha give different trade-offs between mode coverage, which alpha would you pick for a real image-generation task?

Связанные уроки

  • ig-15-stat-manifold-advanced — e/m-geodesics and the Pythagorean theorem
  • ig-04-kl-bregman — KL is a special case of alpha-divergence
  • ig-13-generative — GANs use different divergences
Alpha-Divergences and Generalized Geometry

0

1

Sign In