Information Geometry

Exponential families

Gaussian, Bernoulli, Poisson, Gamma - they look completely different. But all share one formula: p(x|eta) = h(x) exp(eta*T(x) - A(eta)). This is not a coincidence - this is the exponential family. And this form is exactly why KL in VAE computes analytically, why logistic regression is convex, why Adam implicitly does the same thing as natural gradient.

**VAE / variational inference**: ELBO contains KL(q(z|x) || p(z)). For Gaussian q and p this is analytic - KL = (mu^2 + sigma^2 - log(sigma^2) - 1)/2. Works because both are from the exponential family - KL is a Bregman divergence via log-partition A(eta). Without this structure - Monte Carlo sampling on every training step
**Natural gradient (Amari)**: Fisher matrix of exp-family = A''(eta) = Var[T(x)]. Natural step: theta -= alpha * A''(eta)^{-1} * grad. Adam approximates exactly this diagonally. K-FAC builds a block-wise approximation. Both beat SGD by accounting for the geometry of the exp-family manifold
**GLM (generalized linear models)**: logistic regression, Poisson regression, gamma regression - all are special cases of one architecture. Natural parameter eta = w^T x, link function maps mean to eta. Convexity is guaranteed by the exponential family structure
**Thompson sampling / bandits**: conjugate pairs (Beta-Bernoulli, Gamma-Poisson) update posterior with a single addition. PyTorch distributions: kl_divergence between objects of the same exponential family has closed form - for exactly this reason

Предварительные знания

Fisher information metric: the only reasonable metric on the space of distributions

Canonical form: one formula for all distributions

The Gaussian looks like exp(-(x-mu)^2/2sigma^2). Bernoulli looks like p^x(1-p)^(1-x). Poisson like lambda^x exp(-lambda)/x!. Three formulas from three different textbooks. But on second look - they are one.

The logit in logistic regression is not an invented link function. It is the natural parameter of Bernoulli. Writing `sigmoid(w^T x)` maps from eta-space back to probability space. That is why PyTorch has `BCEWithLogitsLoss` instead of `BCELoss`: operating in eta-space is numerically stable because the log-sum-exp cancels cleanly.

**Why natural parameters eta matter**: in eta-space the parameter domain is convex and the log-likelihood is concave in eta. This means MLE has exactly one solution, no local minima. That is why GLMs - logistic regression, Poisson regression, gamma regression - optimize without issues that plague general non-convex problems.

Gaussian distribution in canonical form (full derivation)

N(mu, sigma^2) - two natural parameters

p(x; mu, sigma^2) = (1/sqrt(2pi sigma^2)) exp(-(x-mu)^2 / 2sigma^2) Expand the exponent: -(x-mu)^2 / 2sigma^2 = -x^2/(2sigma^2) + mu*x/sigma^2 - mu^2/(2sigma^2) This is eta^T T(x) - A(eta) with: eta1 = mu/sigma^2 <- natural parameter for the mean eta2 = -1/(2sigma^2) <- natural parameter for the variance T(x) = [x, x^2] <- sufficient statistic vector A(eta) = -eta1^2/(4*eta2) + (1/2) log(-pi/eta2) h(x) = 1 Back to (mu, sigma^2) from eta: sigma^2 = -1/(2*eta2) mu = -eta1/(2*eta2) VAE uses this: the encoder predicts [mu, log_sigma] - the eta-parameterization of the Gaussian posterior q(z|x).

The natural parameter of Bernoulli(p) is eta = log(p/(1-p)). What is the name of the inverse map: p = f(eta)?

A(eta): the function that knows everything about the distribution

The log-partition function A(eta) is a normalization constant. A boring name for an object with a remarkable property: all moments of the distribution are encoded in its derivatives.

**A''(eta) = Var[T(x)]** is not a coincidence - it is the Fisher information matrix of the exponential family. More precisely: the Fisher matrix of an exponential family distribution equals the Hessian of A(eta). This is why natural gradient in an exponential family is computed analytically without matrix inversion at each step. That is what makes variational inference in VAE tractable.

**ELBO in VAE and exponential families**: when the encoder predicts q(z|x) = N(mu, sigma^2), KL(q||p) is computed in closed form: KL(N(mu, sigma^2) || N(0,1)) = (mu^2 + sigma^2 - log(sigma^2) - 1)/2. This works because both are Gaussians (same family) and KL equals the Bregman divergence generated by A(eta). If q(z|x) were not from an exponential family, ELBO would require Monte Carlo sampling on every training step.

Natural gradient in exponential families

Why Amari natural gradient is just A''(eta)^{-1} grad

Standard gradient descent: eta <- eta - alpha * grad_eta L Problem: a step in eta-space ignores the curvature of the distribution manifold. Natural gradient (Amari 1998): eta <- eta - alpha * I(eta)^{-1} * grad_eta L For exponential families: I(eta) = A''(eta) = Var[T(x)] <- Fisher = Hessian of A So the natural step becomes: eta <- eta - alpha * Var[T(x)]^{-1} * grad L This is exactly what Adam does under the diagonal approximation: theta <- theta - alpha * diag(v)^{-1} * grad (where v ≈ E[g^2] ≈ diag Fisher) K-FAC (Martens 2015) builds a block-diagonal approximation of A''(eta) for neural networks with exp-family outputs (softmax, sigmoid layers).

For Poisson, dA/d(eta) = exp(eta) = lambda. What does the second derivative d^2A/d(eta)^2 give?

Conjugate priors: why beta-binomial and gamma-Poisson close analytically

Bayesian updating in general requires numerical integration. Given prior p(theta), multiply by likelihood p(x|theta), normalize, and the posterior is recovered. For arbitrary prior and likelihood the normalization has no closed form. But for exponential families there exists a class of priors where everything closes into a formula.

The Beta distribution is itself an exponential family distribution over parameter p. A conjugate prior is "self-consistent" in shape: prior and posterior belong to the same class. This is not a mathematical accident - it follows from the likelihood entering the posterior linearly through T(x), which is the structural property of exponential families.

Variational inference in most probabilistic frameworks (Pyro, NumPyro, Stan) exploits exactly this. Specifying `q(z) = Normal(mu, sigma)` for a Gaussian prior makes the posterior update analytic. Edward2 (Google), Turing.jl - under the hood these systems search for conjugate pairs to replace MCMC with analytic updates. The technique is called conjugate Bayes or collapsed inference.

Likelihood	Conjugate prior	Parameter	ML application
Bernoulli	Beta(alpha, beta)	p - success probability	Thompson sampling in bandits
Poisson	Gamma(a, b)	lambda - event rate	Modeling clicks, events, counts
Gaussian (mu)	Gaussian(mu0, tau)	mu - mean	Gaussian process regression
Categorical	Dirichlet(alpha)	p - probability vector	LDA, topic models, softmax prior
Gaussian (sigma^2)	Inverse-Gamma(a, b)	sigma^2 - variance	Bayesian linear regression

**PyTorch distributions and ExponentialFamily**: `torch.distributions.Bernoulli`, `Poisson`, `Normal`, `Gamma` all inherit from `ExponentialFamily`. The `log_prob` method uses `_natural_params` and `_log_normalizer` (same as A(eta)). `torch.distributions.kl_divergence` has closed-form implementations for pairs from the same family - precisely because KL equals the Bregman divergence generated by A(eta).

Conjugate priors are a convenient mathematical trick with no deeper meaning

Conjugate priors are a consequence of exponential family structure: a prior of the form exp(eta*chi - nu*A(eta)) when multiplied by the likelihood exp(eta*T(x) - A(eta)) yields a posterior of the same form

Understanding conjugacy through exp-families explains WHY it works and allows one to derive conjugate priors systematically rather than memorizing a lookup table

Thompson sampling for a multi-armed bandit with Bernoulli rewards uses a Beta prior. After observing 10 rewards in 30 trials for arm k, the posterior is Beta(alpha + 10, beta + 20). What does this represent in terms of exponential family structure?

Key takeaways

**Canonical form p(x|eta) = h(x) exp(eta*T(x) - A(eta))**: eta is the natural parameter, T(x) the sufficient statistic, A(eta) the log-partition. Gaussian, Bernoulli, Poisson, Dirichlet are all special cases
**A'(eta) = E[T(x)], A''(eta) = Var[T(x)]**: log-partition encodes all moments. First derivative is mean, second is variance. Fisher matrix = A''(eta). Natural gradient = A''(eta)^{-1} grad
**Conjugate priors**: a prior of the form exp(eta*chi - nu*A(eta)) updates analytically - posterior has the same form with chi += sum(T(x_i)), nu += n. Beta-Bernoulli, Gamma-Poisson, Dirichlet-Categorical - all one structure
**Why this matters for ML**: VAE - KL analytic, GLM - convexity, Adam/K-FAC - approximation of A''(eta), Thompson sampling - posterior in one line. PyTorch `ExponentialFamily` is not an incidental class - it is the key abstract type

Where to go next

Exponential families are the bridge between statistics and geometry. The next step is divergences and the dually flat structure they generate.

KL and Bregman divergences — KL between two exp-family members is a Bregman divergence generated by A(eta)
Dually flat manifolds — Exp-families are the prime example of dually flat structure (e-flat and m-flat coordinates)
Natural gradient — Fisher = A''(eta) for exp-families makes natural gradient analytic and tractable
MLE and sufficient statistics — MLE for exp-family is a closed equation via E[T(x)]: A'(eta_hat) = mean(T(x_i))

Вопросы для размышления

PyTorch `torch.distributions` has an `ExponentialFamily` class with methods `_natural_params`, `_log_normalizer`, `_mean_carrier_measure`. Knowing that A'(eta) = E[T(x)] and A''(eta) = Fisher, how should the `entropy()` method work for any exponential family - without knowing the specific distribution?
VAE uses a Gaussian posterior q(z|x) = N(mu, sigma^2). If one replaced it with a Laplace distribution (also from an exponential family) - would ELBO retain its analytic form? What changes in the KL term?
Natural gradient descent for logistic regression uses A''(eta) = p(1-p). As p approaches 0 or 1 this factor approaches 0 - the natural gradient step grows large. Is this a bug or a feature? How does this connect to the Fisher information for Bernoulli equaling 1/(p(1-p)) growing at the boundaries?

Связанные уроки

ig-02-fisher-metric — Fisher metric of exp-family has special simplicity via A''(eta)
ig-07-natural-gradient — Natural gradient in exp-family is computed analytically through A''(eta)
ig-04-kl-bregman — KL between exp-family members is a Bregman divergence via A(eta)
stat-03-mle — MLE in exp-family has closed form through sufficient statistics
stat-11-bayesian — Conjugate priors are the exp-family structure that gives closed Bayes updates
prob-09-discrete-dist — Gaussian, Bernoulli, Poisson - instances of one formula