Information Geometry

Exponential Families

Normal, Poisson, binomial, gamma, and beta distributions all share one algebraic structure. That structure explains why closed-form MLE, EM M-steps, and variational inference exist for all of them - not by coincidence but by geometry.

VAE (Google Brain): normal distribution as posterior q(z|x) uses natural parameters eta_1 = mu/sigma^2, eta_2 = -1/(2sigma^2); KL regularization reduces to an analytic O(d) formula
GLMs: logistic and Poisson regression are exponential family models; the link function and IRLS algorithm follow automatically from the family structure
EM algorithm: the M-step in exponential families is an analytic update mu = nabla A(eta) matching sufficient statistics - no numerical optimization needed
Conjugate Bayesian inference: exponential family likelihoods with conjugate priors give analytic posterior updates by simply adding natural parameters

Предварительные знания

Probability distributions and moments
Fisher information matrix
Legendre-Fenchel transform

Exponential Families and the Log-Partition Function

An exponential family is defined by p(x|eta) = h(x) exp(eta^T T(x) - A(eta)), where T(x) is the sufficient statistic, eta is the natural parameter, and A(eta) = log integral h(x) exp(eta^T T(x)) dx is the log-partition function. Most classical distributions belong to this class: one only needs to check that the log-likelihood is linear in some function of the parameters.

Key examples with natural parameters and sufficient statistics: Gaussian N(mu, sigma^2): eta = (mu/sigma^2, -1/(2 sigma^2)), T(x) = (x, x^2). Poisson Poi(lambda): eta = log lambda, T(x) = x. Bernoulli Ber(p): eta = log(p/(1-p)) (log-odds), T(x) = x. Each family's complete geometry is encoded in A(eta).

Exponential families are the 'flat' submanifolds of the statistical manifold in the exponential connection: in natural coordinates eta, geodesics are straight lines and the curvature tensor vanishes. This explains why closed-form MLE, EM M-steps, and conjugate Bayesian updates exist for all of them simultaneously.

What does the Hessian of the log-partition function A(eta) equal in an exponential family?

In exponential families Hess(A(eta)) = Cov(T(X)) = F(eta). This identity links geometry (Fisher metric), statistics (covariance), and analysis (Hessian of the normalizer) in a single structure. It implies strict convexity of A and injectivity of the map eta -> mu.

Legendre Transform and Dual Geometry

The Legendre transform A*(mu) = sup_eta(eta^T mu - A(eta)) establishes a bijection between natural parameters eta and mean parameters mu. Geometrically, eta and mu are two coordinate atlases on the same statistical manifold, related by the convex duality of A and A*. The dual flat structure is what allows exact Pythagorean theorems, closed-form projections, and analytic EM updates.

In variational autoencoders the KL term between q(z|x) = N(mu, sigma^2) and p(z) = N(0, I) is: KL = (1/2)(mu^T mu + Tr(Sigma) - log det Sigma - d). This closed form exists precisely because the Gaussian is an exponential family and KL is a Bregman divergence expressible via the log-partition function A(eta).

What does the Legendre conjugate A*(mu) equal for an exponential family?

A*(mu) = sup_eta(eta^T mu - A(eta)) = eta*(mu)^T mu - A(eta*(mu)) = -H(p_mu). This follows by substituting the definition H(p) = -integral p log p and using nabla A(eta) = mu. The connection between convex duality and Shannon entropy is a key insight of information geometry.

Maximum Entropy Principle and Exponential Family Structure

The maximum entropy principle (Jaynes, 1957) states that given constraints on expected values of features T(x), the distribution with largest entropy satisfying those constraints is the unique exponential family distribution with sufficient statistic T(x). This provides a foundational justification for exponential families: they are the 'most unbiased' distributions consistent with the observed statistics.

Conjugate priors for exponential families have the form p(eta | chi, nu) = h(eta) exp(chi^T eta - nu A(eta) - B(chi, nu)). After observing n data points, the posterior updates by simply incrementing chi by sum T(x_i) and nu by n. This is Bayesian conjugacy as parameter addition in natural coordinates.

According to the maximum entropy principle, which distribution maximizes entropy subject to E[T(x)] = mu?

The constrained MaxEnt optimization has the exponential family p*(x) = h(x) exp(eta^T T(x) - A(eta)) as its unique solution, with Lagrange multipliers becoming natural parameters. This provides a foundational justification: exponential families are maximum-entropy distributions for given moment constraints, not ad hoc constructions.

Connections to other topics

Exponential families unite statistics, convex analysis, and information geometry through the structure of A(eta).

Convex optimization — Related topic
Variational inference — Related topic
Maximum entropy principle — Related topic

Итоги

Form: p(x|eta) = h(x) exp(eta^T T(x) - A(eta)), where eta is the natural parameter and T(x) is the sufficient statistic
nabla A(eta) = E[T(X)] = mu: gradient of the log-partition function gives mean parameters
Hess(A(eta)) = Cov(T(X)) = F(eta): Hessian is both the Fisher matrix and the statistic covariance
Legendre duality: eta and mu linked by nabla A and nabla A*; A*(mu) = negative entropy
KL in an exponential family is the Bregman divergence generated by A(eta)

Exponential Families and the Log-Partition Function

What does the Hessian of the log-partition function A(eta) equal in an exponential family?

Legendre Transform and Dual Geometry

What does the Legendre conjugate A*(mu) equal for an exponential family?

Maximum Entropy Principle and Exponential Family Structure

According to the maximum entropy principle, which distribution maximizes entropy subject to E[T(x)] = mu?

Итоги

Form: p(x|eta) = h(x) exp(eta^T T(x) - A(eta)), where eta is the natural parameter and T(x) is the sufficient statistic

nabla A(eta) = E[T(X)] = mu: gradient of the log-partition function gives mean parameters

Hess(A(eta)) = Cov(T(X)) = F(eta): Hessian is both the Fisher matrix and the statistic covariance

Legendre duality: eta and mu linked by nabla A and nabla A*; A*(mu) = negative entropy

KL in an exponential family is the Bregman divergence generated by A(eta)