Stochastic Processes

MCMC and Sampling

A Bayesian neural network with a million parameters. A hierarchical model for a clinical trial. Parameter estimation from gravitational-wave data. All of it boils down to sampling from complex high-dimensional distributions. MCMC is the workhorse of modern Bayesian statistics and a standard tool in any data scientist's kit.

  • **Bayesian models (Stan, PyMC):** hierarchical models and mixed effects, sampled with NUTS (HMC)
  • **Particle physics and astrophysics:** parameter estimation from huge datasets via MCMC, with rigorous uncertainty quantification
  • **Generative models (Langevin diffusion):** sampling from a neural-network posterior reduces to a Langevin equation

Предварительные знания

  • Discrete-Time Markov Chains
  • Martingale Inequalities and Convergence

The Metropolis-Hastings Algorithm

How does one sample from a complicated distribution π(x) known only up to a normalizing constant? The Metropolis-Hastings algorithm builds a Markov chain whose stationary distribution is π. The idea is simple: propose random steps, and accept or reject each one with a probability set by the ratio of densities.

**Metropolis-Hastings algorithm:** 1. From the current x, propose y ~ q(y|x) 2. Accept y with probability α = min(1, π(y)q(x|y) / (π(x)q(y|x))) 3. Otherwise stay at x **Guarantee:** The chain has stationary distribution π (detailed balance). For symmetric q(y|x) = q(x|y): α = min(1, π(y)/π(x))-the Metropolis algorithm.

**Detailed balance** guarantees correctness: π(x)·q(y|x)·α(x,y) = π(y)·q(x|y)·α(y,x). It is a sufficient (not necessary) condition for π to be the stationary distribution. **Step size** is the critical knob: too small means slow mixing; too large means low acceptance. The optimal acceptance rate for a random walk in d dimensions is about 23%.

PropertyGood samplerBad sampler
Acceptance rate23-50%<5% or >95%
AutocorrelationDecays quicklyDecays slowly
Effective Sample Size (ESS)Close to N<<N
Trace plotJumps rapidlySticks in one region
R-hat (Gelman-Rubin)<1.01 (multiple chains)>1.1 = not converged

After burnin, an MCMC chain produces independent samples from π

MCMC samples are correlated! Consecutive steps depend on previous ones. The effective sample size (ESS) << N due to autocorrelation

Independent samples give ESS = N. For MCMC, ESS = N / (1 + 2*sum of autocorrelations). With slow mixing, ESS << N. This matters for accuracy: confidence intervals should be computed using ESS, not N

In the Metropolis algorithm the acceptance rate is 2%. What is the right fix?

The Gibbs Sampler

When sampling from a multivariate distribution π(x₁,...,x_d), the conditional distributions π(x_i | x_{-i}) often have a closed form even though the joint does not. The **Gibbs sampler** uses that: it updates each coordinate in turn by sampling from its conditional. It is a special case of Metropolis-Hastings with acceptance rate 1 at every step.

**Gibbs sampler:** For distribution π(x₁,...,x_d): 1. For k = 1,...,d: Sample x_k^{new} ~ π(x_k | x_1^{new},...,x_{k-1}^{new}, x_{k+1},...,x_d) 2. Repeat **Guarantee:** Detailed balance holds for each conditional update → π is the stationary distribution. **Requirement:** Must be able to efficiently sample from all conditional distributions.

**Blocked Gibbs:** instead of updating one coordinate at a time, update blocks of variables together. This cuts autocorrelation when variables inside a block are strongly correlated. In Bayesian hierarchical models (Stan, PyMC), Gibbs is the workhorse for conjugate components.

MethodKey ideaWhen to use
Metropolis-HastingsAccept/reject random proposalsGeneral case
GibbsSample from conditionalsWhen conditionals are known analytically
Blocked GibbsUpdate blocks of variablesStrongly correlated parameters
Slice samplingSample a "horizontal slice"One-dimensional, no step size needed

The Gibbs sampler is always faster than Metropolis because its acceptance rate is 100%

Acceptance rate = 100% is a local property of each step, not a measure of mixing speed. Gibbs can be slower than Metropolis when variables are highly correlated

MCMC speed is determined by mixing time, not acceptance rate. With high correlation, Gibbs takes tiny diagonal steps along the ellipse, while Metropolis with the right step size can jump across the entire ellipse

Why does the Gibbs sampler mix slowly when variables are highly correlated?

Hamiltonian Monte Carlo

Standard Metropolis wanders at random without using any structure of the target distribution. **Hamiltonian Monte Carlo (HMC)** uses the gradient of log π for directed motion: instead of a random walk, it simulates deterministic Hamiltonian dynamics that follow the contours of the distribution. This is the foundation of modern Bayesian frameworks (Stan, PyMC, NumPyro).

**Hamiltonian Monte Carlo:** 1. Add momentum p ~ N(0, M) (auxiliary variable) 2. Simulate Hamiltonian dynamics (leapfrog): H(x,p) = -log π(x) + p²/2M x_{t+ε} = x_t + ε·p_t/M p_{t+ε} = p_t + ε·∇log π(x_{t+ε}) 3. Accept/reject by the Metropolis criterion (corrects for discretization error) **Advantage:** Proposals are far from the current point but are still accepted with high probability.

**No-U-Turn Sampler (NUTS)**-the adaptive version of HMC used in Stan-automatically chooses L (the number of leapfrog steps) by stopping when the path "doubles back" (U-turn). This removes the need to manually tune L, which is HMC's main hyperparameter.

HMC in Bayesian statistics

HMC was proposed by Duane and colleagues in 1987 for lattice quantum chromodynamics. Neal (1996) adapted it for Bayesian statistics. NUTS (Hoffman and Gelman, 2011) made HMC practical without manual tuning. Stan (2012) implemented NUTS with automatic differentiation, which transformed applied Bayesian work. Today HMC and NUTS are the standard for hierarchical Bayesian models in both science and industry.

HMC works for any distribution without additional requirements

HMC requires: 1. continuous parameters 2. differentiable log π 3. tuning of ε and L (or NUTS). Discrete parameters need other methods

Hamiltonian dynamics use the gradient-differentiability is required. Discrete variables have no gradient. Hybrid approaches: Gibbs for discrete + HMC for continuous-are used in modern PPLs

Why is HMC more efficient than random-walk Metropolis in high-dimensional problems?

Convergence Diagnostics

How can one tell an MCMC chain has converged to π? Certainty is impossible-evidence only accumulates. Several diagnostics exist: visual (trace plots, autocorrelation plots) and quantitative (Gelman-Rubin R-hat, ESS, BFMI for HMC).

**Key MCMC diagnostics:** **R-hat (Gelman-Rubin):** Run multiple chains from different starting points. R̂ = sqrt(Var_between / Var_within + (N-1)/N) R̂ < 1.01-good; R̂ > 1.1-not converged **ESS (Effective Sample Size):** ESS = N / (1 + 2·∑_{k=1}^∞ ρ_k) where ρ_k is the autocorrelation at lag k. **BFMI (Bayesian Fraction Missing Information):** specific to HMC-checks the chain's energy.

**Practical guidelines:** - Always run multiple chains (at least 4) from different starting points - Discard burnin (typically 50% of iterations or until the trace looks stable) - Check R̂ < 1.01 and ESS > 400 for reliable estimates - For HMC: check acceptance rate 60-90% and BFMI > 0.2

DiagnosticGoodProblemFix
R-hat< 1.01> 1.1More iterations, diverse starting points
ESS> 400< 100More iterations, better sampler
Acceptance rate (MH)23-50%< 5% or > 95%Tune step size
Trace plot"Caterpillar"Trend or stickingMore burnin or different sampler
AutocorrelationDecays quicklyHigh at large lagsThinning or better sampler

If the MCMC trace plot has stabilized (appears horizontal), the chain has converged

A stable trace plot is necessary but not sufficient for convergence. A chain can get stuck in one mode of a multimodal distribution, look stable, and never explore the other modes

For multimodal distributions, a chain can linger in one mode for a long time. R-hat detects this when multiple chains are run from different starting points: if some chains land in a different mode, R-hat will be > 1

R-hat = 1.05 after 10000 iterations for each of 4 chains. What is the right next step?

Key Ideas

  • **Metropolis-Hastings:** propose a step, accept with prob = min(1, π(y)/π(x)·q(x|y)/q(y|x)); driven by detailed balance
  • **Gibbs:** sample each coordinate from its conditional in turn; acceptance rate is 1; mixes slowly when correlations are high
  • **HMC:** uses the gradient of log π and leapfrog integration; efficient in high dimensions; NUTS is the adaptive version
  • **Diagnostics:** R̂ < 1.01, ESS > 400, trace plots; always run multiple chains from different starting points

Related Topics

MCMC connects Markov chain theory with computational Bayesian statistics:

  • Markov Chains — MCMC builds a discrete-time Markov chain with stationary distribution π; DTMC convergence theory applies
  • Martingale Inequalities — ESS and MCMC convergence rates are derived via martingale arguments
  • Stochastic in ML — SGD and Langevin dynamics are MCMC for optimization; score matching is related to HMC

Вопросы для размышления

  • MCMC and SGD both use randomness to move through parameter space. What is the core difference in what they are trying to achieve?
  • Diffusion generative models use Langevin MCMC for sampling. Why are they faster than standard MCMC?
  • If R-hat = 2.0, what does that say about the geometry of the target distribution, and how should the problem be reformulated?

Связанные уроки

  • prob-04-bayes
MCMC and Sampling

0

1

Sign In