Information Geometry
Exponential Families
Normal, Poisson, binomial, gamma, and beta distributions all share one algebraic structure. That structure explains why closed-form MLE, EM M-steps, and variational inference exist for all of them - not by coincidence but by geometry.
- VAE (Google Brain): normal distribution as posterior q(z|x) uses natural parameters eta_1 = mu/sigma^2, eta_2 = -1/(2sigma^2); KL regularization reduces to an analytic O(d) formula
- GLMs: logistic and Poisson regression are exponential family models; the link function and IRLS algorithm follow automatically from the family structure
- EM algorithm: the M-step in exponential families is an analytic update mu = nabla A(eta) matching sufficient statistics - no numerical optimization needed
- Conjugate Bayesian inference: exponential family likelihoods with conjugate priors give analytic posterior updates by simply adding natural parameters
Предварительные знания
- Probability distributions and moments
- Fisher information matrix
- Legendre-Fenchel transform
Exponential Families and the Log-Partition Function
An exponential family is defined by p(x|eta) = h(x) exp(eta^T T(x) - A(eta)), where T(x) is the sufficient statistic, eta is the natural parameter, and A(eta) = log integral h(x) exp(eta^T T(x)) dx is the log-partition function. Most classical distributions belong to this class: one only needs to check that the log-likelihood is linear in some function of the parameters.
Key examples with natural parameters and sufficient statistics: Gaussian N(mu, sigma^2): eta = (mu/sigma^2, -1/(2 sigma^2)), T(x) = (x, x^2). Poisson Poi(lambda): eta = log lambda, T(x) = x. Bernoulli Ber(p): eta = log(p/(1-p)) (log-odds), T(x) = x. Each family's complete geometry is encoded in A(eta).
Exponential families are the 'flat' submanifolds of the statistical manifold in the exponential connection: in natural coordinates eta, geodesics are straight lines and the curvature tensor vanishes. This explains why closed-form MLE, EM M-steps, and conjugate Bayesian updates exist for all of them simultaneously.
What does the Hessian of the log-partition function A(eta) equal in an exponential family?
In exponential families Hess(A(eta)) = Cov(T(X)) = F(eta). This identity links geometry (Fisher metric), statistics (covariance), and analysis (Hessian of the normalizer) in a single structure. It implies strict convexity of A and injectivity of the map eta -> mu.
Legendre Transform and Dual Geometry
The Legendre transform A*(mu) = sup_eta(eta^T mu - A(eta)) establishes a bijection between natural parameters eta and mean parameters mu. Geometrically, eta and mu are two coordinate atlases on the same statistical manifold, related by the convex duality of A and A*. The dual flat structure is what allows exact Pythagorean theorems, closed-form projections, and analytic EM updates.
In variational autoencoders the KL term between q(z|x) = N(mu, sigma^2) and p(z) = N(0, I) is: KL = (1/2)(mu^T mu + Tr(Sigma) - log det Sigma - d). This closed form exists precisely because the Gaussian is an exponential family and KL is a Bregman divergence expressible via the log-partition function A(eta).
What does the Legendre conjugate A*(mu) equal for an exponential family?
A*(mu) = sup_eta(eta^T mu - A(eta)) = eta*(mu)^T mu - A(eta*(mu)) = -H(p_mu). This follows by substituting the definition H(p) = -integral p log p and using nabla A(eta) = mu. The connection between convex duality and Shannon entropy is a key insight of information geometry.
Maximum Entropy Principle and Exponential Family Structure
The maximum entropy principle (Jaynes, 1957) states that given constraints on expected values of features T(x), the distribution with largest entropy satisfying those constraints is the unique exponential family distribution with sufficient statistic T(x). This provides a foundational justification for exponential families: they are the 'most unbiased' distributions consistent with the observed statistics.
Conjugate priors for exponential families have the form p(eta | chi, nu) = h(eta) exp(chi^T eta - nu A(eta) - B(chi, nu)). After observing n data points, the posterior updates by simply incrementing chi by sum T(x_i) and nu by n. This is Bayesian conjugacy as parameter addition in natural coordinates.
According to the maximum entropy principle, which distribution maximizes entropy subject to E[T(x)] = mu?
The constrained MaxEnt optimization has the exponential family p*(x) = h(x) exp(eta^T T(x) - A(eta)) as its unique solution, with Lagrange multipliers becoming natural parameters. This provides a foundational justification: exponential families are maximum-entropy distributions for given moment constraints, not ad hoc constructions.
Connections to other topics
Exponential families unite statistics, convex analysis, and information geometry through the structure of A(eta).
- Convex optimization — Related topic
- Variational inference — Related topic
- Maximum entropy principle — Related topic
Итоги
- Form: p(x|eta) = h(x) exp(eta^T T(x) - A(eta)), where eta is the natural parameter and T(x) is the sufficient statistic
- nabla A(eta) = E[T(X)] = mu: gradient of the log-partition function gives mean parameters
- Hess(A(eta)) = Cov(T(X)) = F(eta): Hessian is both the Fisher matrix and the statistic covariance
- Legendre duality: eta and mu linked by nabla A and nabla A*; A*(mu) = negative entropy
- KL in an exponential family is the Bregman divergence generated by A(eta)