Stochastic Processes
MCMC and Sampling
A Bayesian neural network with a million parameters. A hierarchical model for a clinical trial. Parameter estimation from gravitational-wave data. All of it boils down to sampling from complex high-dimensional distributions. MCMC is the workhorse of modern Bayesian statistics and a standard tool in any data scientist's kit.
- **Bayesian models (Stan, PyMC):** hierarchical models and mixed effects, sampled with NUTS (HMC)
- **Particle physics and astrophysics:** parameter estimation from huge datasets via MCMC, with rigorous uncertainty quantification
- **Generative models (Langevin diffusion):** sampling from a neural-network posterior reduces to a Langevin equation
Предварительные знания
The Metropolis-Hastings Algorithm
How does one sample from a complicated distribution π(x) known only up to a normalizing constant? The Metropolis-Hastings algorithm builds a Markov chain whose stationary distribution is π. The idea is simple: propose random steps, and accept or reject each one with a probability set by the ratio of densities.
**Metropolis-Hastings algorithm:** 1. From the current x, propose y ~ q(y|x) 2. Accept y with probability α = min(1, π(y)q(x|y) / (π(x)q(y|x))) 3. Otherwise stay at x **Guarantee:** The chain has stationary distribution π (detailed balance). For symmetric q(y|x) = q(x|y): α = min(1, π(y)/π(x))-the Metropolis algorithm.
**Detailed balance** guarantees correctness: π(x)·q(y|x)·α(x,y) = π(y)·q(x|y)·α(y,x). It is a sufficient (not necessary) condition for π to be the stationary distribution. **Step size** is the critical knob: too small means slow mixing; too large means low acceptance. The optimal acceptance rate for a random walk in d dimensions is about 23%.
| Property | Good sampler | Bad sampler |
|---|---|---|
| Acceptance rate | 23-50% | <5% or >95% |
| Autocorrelation | Decays quickly | Decays slowly |
| Effective Sample Size (ESS) | Close to N | <<N |
| Trace plot | Jumps rapidly | Sticks in one region |
| R-hat (Gelman-Rubin) | <1.01 (multiple chains) | >1.1 = not converged |
After burnin, an MCMC chain produces independent samples from π
MCMC samples are correlated! Consecutive steps depend on previous ones. The effective sample size (ESS) << N due to autocorrelation
Independent samples give ESS = N. For MCMC, ESS = N / (1 + 2*sum of autocorrelations). With slow mixing, ESS << N. This matters for accuracy: confidence intervals should be computed using ESS, not N
In the Metropolis algorithm the acceptance rate is 2%. What is the right fix?
The Gibbs Sampler
When sampling from a multivariate distribution π(x₁,...,x_d), the conditional distributions π(x_i | x_{-i}) often have a closed form even though the joint does not. The **Gibbs sampler** uses that: it updates each coordinate in turn by sampling from its conditional. It is a special case of Metropolis-Hastings with acceptance rate 1 at every step.
**Gibbs sampler:** For distribution π(x₁,...,x_d): 1. For k = 1,...,d: Sample x_k^{new} ~ π(x_k | x_1^{new},...,x_{k-1}^{new}, x_{k+1},...,x_d) 2. Repeat **Guarantee:** Detailed balance holds for each conditional update → π is the stationary distribution. **Requirement:** Must be able to efficiently sample from all conditional distributions.
**Blocked Gibbs:** instead of updating one coordinate at a time, update blocks of variables together. This cuts autocorrelation when variables inside a block are strongly correlated. In Bayesian hierarchical models (Stan, PyMC), Gibbs is the workhorse for conjugate components.
| Method | Key idea | When to use |
|---|---|---|
| Metropolis-Hastings | Accept/reject random proposals | General case |
| Gibbs | Sample from conditionals | When conditionals are known analytically |
| Blocked Gibbs | Update blocks of variables | Strongly correlated parameters |
| Slice sampling | Sample a "horizontal slice" | One-dimensional, no step size needed |
The Gibbs sampler is always faster than Metropolis because its acceptance rate is 100%
Acceptance rate = 100% is a local property of each step, not a measure of mixing speed. Gibbs can be slower than Metropolis when variables are highly correlated
MCMC speed is determined by mixing time, not acceptance rate. With high correlation, Gibbs takes tiny diagonal steps along the ellipse, while Metropolis with the right step size can jump across the entire ellipse
Why does the Gibbs sampler mix slowly when variables are highly correlated?
Hamiltonian Monte Carlo
Standard Metropolis wanders at random without using any structure of the target distribution. **Hamiltonian Monte Carlo (HMC)** uses the gradient of log π for directed motion: instead of a random walk, it simulates deterministic Hamiltonian dynamics that follow the contours of the distribution. This is the foundation of modern Bayesian frameworks (Stan, PyMC, NumPyro).
**Hamiltonian Monte Carlo:** 1. Add momentum p ~ N(0, M) (auxiliary variable) 2. Simulate Hamiltonian dynamics (leapfrog): H(x,p) = -log π(x) + p²/2M x_{t+ε} = x_t + ε·p_t/M p_{t+ε} = p_t + ε·∇log π(x_{t+ε}) 3. Accept/reject by the Metropolis criterion (corrects for discretization error) **Advantage:** Proposals are far from the current point but are still accepted with high probability.
**No-U-Turn Sampler (NUTS)**-the adaptive version of HMC used in Stan-automatically chooses L (the number of leapfrog steps) by stopping when the path "doubles back" (U-turn). This removes the need to manually tune L, which is HMC's main hyperparameter.
HMC in Bayesian statistics
HMC was proposed by Duane and colleagues in 1987 for lattice quantum chromodynamics. Neal (1996) adapted it for Bayesian statistics. NUTS (Hoffman and Gelman, 2011) made HMC practical without manual tuning. Stan (2012) implemented NUTS with automatic differentiation, which transformed applied Bayesian work. Today HMC and NUTS are the standard for hierarchical Bayesian models in both science and industry.
HMC works for any distribution without additional requirements
HMC requires: 1. continuous parameters 2. differentiable log π 3. tuning of ε and L (or NUTS). Discrete parameters need other methods
Hamiltonian dynamics use the gradient-differentiability is required. Discrete variables have no gradient. Hybrid approaches: Gibbs for discrete + HMC for continuous-are used in modern PPLs
Why is HMC more efficient than random-walk Metropolis in high-dimensional problems?
Convergence Diagnostics
How can one tell an MCMC chain has converged to π? Certainty is impossible-evidence only accumulates. Several diagnostics exist: visual (trace plots, autocorrelation plots) and quantitative (Gelman-Rubin R-hat, ESS, BFMI for HMC).
**Key MCMC diagnostics:** **R-hat (Gelman-Rubin):** Run multiple chains from different starting points. R̂ = sqrt(Var_between / Var_within + (N-1)/N) R̂ < 1.01-good; R̂ > 1.1-not converged **ESS (Effective Sample Size):** ESS = N / (1 + 2·∑_{k=1}^∞ ρ_k) where ρ_k is the autocorrelation at lag k. **BFMI (Bayesian Fraction Missing Information):** specific to HMC-checks the chain's energy.
**Practical guidelines:** - Always run multiple chains (at least 4) from different starting points - Discard burnin (typically 50% of iterations or until the trace looks stable) - Check R̂ < 1.01 and ESS > 400 for reliable estimates - For HMC: check acceptance rate 60-90% and BFMI > 0.2
| Diagnostic | Good | Problem | Fix |
|---|---|---|---|
| R-hat | < 1.01 | > 1.1 | More iterations, diverse starting points |
| ESS | > 400 | < 100 | More iterations, better sampler |
| Acceptance rate (MH) | 23-50% | < 5% or > 95% | Tune step size |
| Trace plot | "Caterpillar" | Trend or sticking | More burnin or different sampler |
| Autocorrelation | Decays quickly | High at large lags | Thinning or better sampler |
If the MCMC trace plot has stabilized (appears horizontal), the chain has converged
A stable trace plot is necessary but not sufficient for convergence. A chain can get stuck in one mode of a multimodal distribution, look stable, and never explore the other modes
For multimodal distributions, a chain can linger in one mode for a long time. R-hat detects this when multiple chains are run from different starting points: if some chains land in a different mode, R-hat will be > 1
R-hat = 1.05 after 10000 iterations for each of 4 chains. What is the right next step?
Key Ideas
- **Metropolis-Hastings:** propose a step, accept with prob = min(1, π(y)/π(x)·q(x|y)/q(y|x)); driven by detailed balance
- **Gibbs:** sample each coordinate from its conditional in turn; acceptance rate is 1; mixes slowly when correlations are high
- **HMC:** uses the gradient of log π and leapfrog integration; efficient in high dimensions; NUTS is the adaptive version
- **Diagnostics:** R̂ < 1.01, ESS > 400, trace plots; always run multiple chains from different starting points
Related Topics
MCMC connects Markov chain theory with computational Bayesian statistics:
- Markov Chains — MCMC builds a discrete-time Markov chain with stationary distribution π; DTMC convergence theory applies
- Martingale Inequalities — ESS and MCMC convergence rates are derived via martingale arguments
- Stochastic in ML — SGD and Langevin dynamics are MCMC for optimization; score matching is related to HMC
Вопросы для размышления
- MCMC and SGD both use randomness to move through parameter space. What is the core difference in what they are trying to achieve?
- Diffusion generative models use Langevin MCMC for sampling. Why are they faster than standard MCMC?
- If R-hat = 2.0, what does that say about the geometry of the target distribution, and how should the problem be reformulated?