Statistics
Variational Inference (Advanced ELBO)
How can an intractable posterior distribution be approximated in scalable models when MCMC requires weeks of computation?
- **VAE (Kingma & Welling 2013):** training on 60,000 MNIST images in 4 minutes versus hours for MCMC; foundation of modern generative models
- **Latent Dirichlet Allocation:** variational EM processes 1 million documents in hours; MCMC on the same corpus takes weeks
- **Bayesian neural networks:** variational inference for prediction uncertainty in Tesla autonomous driving systems
- **Stan and PyMC:** automatic differentiation variational inference (ADVI) as a fast alternative to MCMC for exploratory analysis
Предварительные знания
- KL divergence
- Bayesian inference
- Hierarchical Bayesian models
Variational inference converts integration into optimization: instead of MCMC sampling from p(z|x), the nearest distribution q_phi(z|x) from a parametric family Q is found by minimizing KL(q‖p). The problem becomes differentiable and is solved by stochastic gradient descent.
Posterior collapse in VAE: when the decoder p_theta(x|z) is powerful (e.g., a large autoregressive network), the network ignores z and q_phi(z|x) degenerates to p(z) (KL → 0). This is problematic for text generation and structured data. Solutions: KL annealing (gradually increasing the KL weight during training), free bits (enforcing minimum KL per latent dimension), or using convolutional decoders that limit receptive field.
Importance Weighted Autoencoder (IWAE, Burda et al., 2015) tightens the ELBO bound: L_K = E[log (1/K) ∑_{k=1}^K p(x,z_k)/q(z_k|x)] satisfies L_K >= L_{K-1} >= ELBO and L_K → log p(x) as K → infinity. For K > 1, gradient variance increases but the bound is tighter. Practical choice: K = 5-50 for final evaluation, K = 1 for fast training then fine-tune with larger K.
Stochastic variational inference (SVI, Hoffman et al., 2013) scales variational inference to large datasets via minibatches: at each step a subsample of data is used to estimate the ELBO gradient. Follows stochastic gradient descent with decreasing step size rho_t = (tau_0 + t)^{-kappa} (kappa in (0.5, 1]). In Latent Dirichlet Allocation this enables training on corpora with millions of documents without loading the full dataset into memory.
Choosing the variational family is the key tradeoff: mean-field Q = prod_i q(z_i) is fast but misses correlations. Full-rank Gaussian q is expensive at O(d²) parameters. Normalizing flows offer intermediate expressiveness with manageable computational cost.
Hierarchical VAE (HVAE, Sonderby et al., 2016) uses multiple latent variable layers: z_1, z_2,...,z_L with top-down inference q(z_1,...,z_L|x) approximated by a ladder network. The ELBO becomes sum_l KL(q(z_l|z_{>l}, x) || p(z_l|z_{>l})). HVAE can capture hierarchical abstractions: lower layers encode fine details, upper layers encode semantics. The Variational Diffusion Model (Kingma et al., 2021) is an HVAE with T → infinity levels.
Amortized inference in VAE: the encoder q_phi(z|x) is a neural network that maps each data point x to a distribution over z without per-datapoint optimization. At test time, inference is a single forward pass through the encoder. This contrasts with classical VI (separate optimization per datapoint). Amortization gap: q_phi(z|x) can never perfectly match the true posterior p(z|x) due to limited network capacity, even in the limit of infinite data.
ELBO and the variational lower bound
Variational inference (VI) replaces MCMC with optimization: the posterior p(z | x) is approximated by a parametric distribution q_φ(z), and φ is chosen to minimise KL(q_φ || p). Direct KL minimisation is intractable (it requires p(x)), but it is equivalent to maximising the ELBO, a lower bound on the log marginal likelihood log p(x).
Why is maximising ELBO equivalent to minimising KL(q || p(z|x)) for fixed data x?
Key identity: log p(x) = ELBO(q_φ) + KL(q_φ || p(z|x)). The left side is fixed for the given x (it is the model's marginal likelihood). So ELBO + KL is constant and maximising ELBO over φ automatically minimises KL. This bypasses the intractable p(x) and lets us work with ELBO only.
mean-field approximation
Mean-field VI is the most popular parametrisation: q factorises over coordinates z_j, q(z) = Π_j q_j(z_j). Each q_j is optimised in turn with the others fixed (coordinate ascent VI, CAVI). Closed-form updates exist for conjugate exponential families.
Mean-field updates for Gaussian models resemble EM but are symmetric over all variables. SVI (Stochastic VI) applies mini-batch SGD to ELBO for large datasets.
What is the main limitation of the mean-field variational approximation q(z) = Π_j q_j(z_j)?
By construction q(z) = Π_j q_j(z_j) cannot express any correlation between coordinates. If the true p(z | x) has strongly correlated components (e.g., latent structure in a hierarchical model), mean-field 'compresses' each coordinate individually and understates the joint variance. The fix is structured VI or normalizing flows that allow dependencies.
reparameterization and the VAE connection
The reparameterization trick (Kingma-Welling, 2013): for smooth q_φ, express z = g(ε, φ) with a fixed base ε ~ N(0, I). Then the ELBO gradient w.r.t. φ moves outside the expectation and is computed by ordinary backprop, the basis of VAEs and amortized VI.
Normalizing flows (Rezende-Mohamed) generalize reparameterization: z = f_K ∘ ... ∘ f_1(ε) where f_k are invertible transformations with a tractable Jacobian. They give flexible non-Gaussian q without losing differentiability.
Why does the reparameterization trick give ELBO gradients with much lower variance than score-function (REINFORCE) estimates?
Score function: ∇_φ E_q[f] = E_q[f · ∇_φ log q], variance is huge because of the log q factor. Reparameterization z = g(ε, φ): ∇_φ E_ε[f(g(ε,φ))] = E_ε[∇_φ f(g(ε,φ))], gradient of f directly through the data path. Pathwise variance is typically 10-100× smaller, which makes VAE training feasible.
Variational inference and generative models
Variational inference is the foundation of modern deep generative models.
- VAE and diffusion models — VAEs directly maximize the ELBO; diffusion models are hierarchical VAEs with a fixed forward process
- GANs — Likelihood-free alternative to generation; trades the probabilistic interpretation for sharper samples
- Bayesian neural networks — VI is applied to network weights to approximate the posterior and estimate epistemic uncertainty
Итоги
- ELBO = E_q[log p(x,z)] - E_q[log q(z|x)] = log p(x) - KL(q||p(z|x)); maximized instead of direct integration
- VAE: reparameterization z = mu_phi + sigma_phi * epsilon enables backprop; ELBO = reconstruction - KL
- Normalizing flows: bijections f_k transform q_0 to q_K with exact density via change-of-variables
- BBVI: score function estimator ∇_phi L = E_q[log p(x,z) ∇_phi log q_phi]; works without reparameterization
- ADVI: Gaussian q in transformed space; fast exploratory approximation in Stan/PyMC