Scientific Computing

Monte Carlo Methods

1945. Los Alamos. The Manhattan Project is trying to simulate neutron diffusion through uranium. The Boltzmann transport equation is an 7-dimensional integral equation with no closed form. Stanislaw Ulam proposes: simulate individual neutrons randomly. Thousands of random trajectories, average the result. John von Neumann writes the first Monte Carlo algorithm in a few hours. The method that saved nuclear physics now prices Goldman Sachs derivatives in milliseconds, estimates Bayesian posteriors for clinical trials, and drives the sampling in diffusion models (Stable Diffusion, DALL-E). Random sampling is a superpower.

  • **JPMorgan risk management**: Monte Carlo simulation of 100,000+ portfolio scenarios per second for Value-at-Risk and Expected Shortfall calculations; variance reduction (antithetic variates + control variates) reduces simulation time by 50-100x versus naive MC
  • **DeepMind AlphaFold**: protein structure prediction uses Markov Chain Monte Carlo (MCMC) for conformational sampling in early stages; combined with neural networks, AlphaFold solved protein folding for 200+ million proteins in 2021
  • **Diffusion models (DALL-E 3, Stable Diffusion)**: image generation reverses a diffusion process by sampling from learned conditional distributions using Langevin dynamics (a continuous-time MCMC variant); each image generation is ~50 MCMC denoising steps

Monte Carlo Integration

In 1945, Stanislaw Ulam was playing solitaire while recovering from surgery. He wondered: what is the probability of winning? Instead of computing it analytically, he thought - just play 100 games and count. This insight became the **Monte Carlo method**, named for the casino. Enrico Fermi and John von Neumann immediately saw its application: simulate nuclear reactions stochastically rather than solving Boltzmann's transport equation analytically. The method: estimate expected values and integrals by averaging over random samples.

**Monte Carlo integration:** `E[f(X)] = ∫ f(x) p(x) dx ≈ (1/N) Σ f(x_i)` where x_i ~ p(x) **Error**: standard deviation of the estimate ~ σ/√N. The error decreases as 1/√N regardless of the dimension d of the integration space. Classical numerical integration (Simpson's rule, Gaussian quadrature) requires O(N^(1/d)) points to achieve error ε in d dimensions - exponentially bad for high d. Monte Carlo achieves the same ε with O(1/ε^2) samples for any d. **Monte Carlo wins in high dimensions.**

**Monte Carlo error is O(1/√N) regardless of dimension** - this is the key advantage over grid-based methods. But it also means reducing error by 10x requires 100x more samples. For problems requiring 4 decimal places of accuracy, expect N = 10^8 samples. This is why variance reduction techniques (next concept) are critical for production Monte Carlo.

Why does Monte Carlo integration outperform grid-based numerical integration in high dimensions (d >> 1)?

Markov Chain Monte Carlo (MCMC)

Direct Monte Carlo requires sampling from the target distribution p(x). For a Bayesian posterior `p(θ|data) ∝ p(data|θ) · p(θ)`, the normalizing constant Z = ∫p(data|θ)p(θ)dθ may be intractable - so direct sampling is impossible. **Markov Chain Monte Carlo** bypasses this: construct a Markov chain whose stationary distribution is the target p(x), then run the chain long enough to collect samples. No normalizing constant required. MCMC powers Bayesian inference, statistical physics, and generative AI sampling.

**Metropolis-Hastings algorithm:** 1. Start at current state x_t 2. Propose x* from proposal q(x*|x_t) (e.g., Gaussian centered at x_t) 3. Compute acceptance ratio: `α = min(1, p(x*)·q(x_t|x*) / (p(x_t)·q(x*|x_t)))` 4. Accept x* with probability α (else stay at x_t) 5. x_{t+1} = x* if accepted, x_t otherwise 6. Repeat Note: p(x*)/p(x_t) = p*(x*)/p*(x_t) - the normalizing constant cancels! This is why MCMC does not need Z.

**MCMC diagnostics**: raw sample count is not effective sample size (ESS). If the Markov chain moves slowly (small proposal std → high autocorrelation), 10,000 samples may give ESS of 100. Tools: R-hat convergence diagnostic (< 1.01 for convergence), trace plots (should look like white noise after burn-in), autocorrelation function (should decay quickly). Stan and PyMC3 compute these automatically.

Why does the Metropolis-Hastings acceptance ratio α = p(x*)/p(x_t) not require the normalizing constant Z?

Importance Sampling

In option pricing, the event 'stock drops 50% in one day' has probability 10^{-10} under the log-normal model. Standard Monte Carlo would need 10^{12} simulations to see a single such event. Yet this scenario dominates the tail risk. **Importance sampling** changes the sampling distribution to oversample rare but important events, then reweights the samples to correct for the distributional change. Used in: financial risk (Value-at-Risk), reliability analysis, off-policy reinforcement learning.

**Importance sampling estimator:** `E_p[f(x)] = ∫ f(x) p(x) dx = ∫ f(x) [p(x)/q(x)] q(x) dx = E_q[f(x) w(x)]` where `w(x) = p(x)/q(x)` are importance weights and q(x) is the **proposal distribution** (easy to sample from, concentrated in important regions). **Self-normalized IS**: `E_p[f] ≈ Σ w_i f(x_i) / Σ w_i` - does not require knowing p's normalizing constant. Used in particle filters and variational inference.

Importance sampling uses a proposal distribution q instead of the target p. What condition on q is required for the estimator to be valid?

Variance Reduction Techniques

Monte Carlo error is O(σ/√N): to halve the error, quadruple the samples. Variance reduction methods reduce σ without increasing N - achieving the same accuracy with fewer simulations. In derivative pricing, reducing σ by 10x means 100x fewer paths. JPMorgan, Goldman Sachs, and Renaissance Technologies routinely use variance reduction to price complex derivatives in seconds instead of hours.

TechniqueMechanismTypical Variance Reduction
Antithetic variatesFor each sample u, also use 1-u (negative correlation)2-4x
Control variatesSubtract analytically tractable correlated variable10-100x
Stratified samplingDivide domain into strata, sample proportionally2-10x
Quasi-Monte Carlo (QMC)Low-discrepancy sequences (Sobol) instead of pseudorandom10-1000x in low-d
Importance samplingConcentrate samples in high-contribution regions10-10^6x for rare events

MCMC produces independent samples from the target distribution, so sample size directly equals effective sample size

MCMC samples are autocorrelated (sequential samples are similar due to the Markov property); effective sample size (ESS) is often 10-100x smaller than raw sample count; proper convergence diagnostics (R-hat, ESS, trace plots) are mandatory

The Markov chain generates correlated samples. A chain that moves slowly (small proposal steps, multimodal target with separated modes) produces highly autocorrelated samples - ESS of 50 from 5000 raw samples. The ESS formula accounts for autocorrelation: ESS = N / (1 + 2·Σ autocorrelation(lag)). Stan and PyMC automatically compute ESS and warn when it is low. Ignoring autocorrelation leads to overconfident posterior intervals.

Quasi-Monte Carlo methods using Sobol sequences achieve O(1/N) convergence for smooth integrands versus O(1/√N) for pseudorandom Monte Carlo. Why does this advantage diminish in high dimensions?

Key Ideas

  • **Monte Carlo integration**: estimate E_p[f] by averaging f(x_i) over samples x_i ~ p; error O(σ/√N) independent of dimension - key advantage over grid methods for d > 5
  • **MCMC (Metropolis-Hastings)**: construct Markov chain with target stationary distribution; acceptance ratio p*/p* cancels normalizing constant; requires burn-in and effective sample size analysis for reliability
  • **Importance sampling**: sample from q(x) instead of p(x) and reweight by w(x) = p(x)/q(x); enables estimation of rare-event probabilities with 10^3-10^6x fewer samples than naive MC
  • **Variance reduction** (control variates, antithetic variates, QMC): reduce σ without increasing N; control variates with analytical tractable correlated quantity achieve 10-100x reduction; QMC gives O(1/N) convergence in low dimensions

Related Topics

Monte Carlo bridges probability theory, computational statistics, and scientific simulation:

  • Numerical Optimization — Stochastic optimization (SGD, simulated annealing) uses Monte Carlo ideas; Bayesian optimization uses surrogate models similar to MCMC posterior approximation
  • Probability and Statistics — Monte Carlo relies on law of large numbers and central limit theorem; variance reduction exploits statistical correlation structure
  • Reinforcement Learning Basics — REINFORCE algorithm is Monte Carlo policy evaluation; off-policy methods use importance sampling to reweight trajectories collected under different policies

Вопросы для размышления

  • Monte Carlo error is O(1/√N): halving the error costs 4x the samples. At what dimension d does Monte Carlo outperform 5-point Gaussian quadrature (which has O(5^d) cost)? What does this crossover imply for choosing between deterministic and stochastic integration methods?
  • An MCMC chain for a 50-dimensional posterior has raw sample count N=100,000 but ESS=500. The Metropolis-Hastings acceptance rate is 95%. What does this diagnose - and what hyperparameter adjustment would improve ESS?
  • Importance sampling fails when the proposal q has lighter tails than the target p (importance weights have infinite variance). Design a worst-case scenario where naive importance sampling with a Gaussian proposal estimates a heavy-tailed integral - what goes wrong numerically and how does robust IS (e.g., PSIS) address it?

Связанные уроки

  • prob-01-intro
Monte Carlo Methods

0

1

Sign In