Numerical Methods
Monte Carlo Methods
How do one price a financial derivative depending on 100 assets? How do one test nuclear weapons without an actual explosion? How do one train a Bayesian neural network? All three involve high-dimensional integrals, intractable for classical methods. Monte Carlo was invented at Los Alamos in the 1940s to simulate nuclear reactions and has since become a foundation of machine learning and quantitative finance.
- **Finance**: Monte Carlo option pricing is standard in banks for path-dependent payoff instruments
- **Rendering**: path tracing (Pixar, Disney, NVIDIA RTX): each pixel is an integral over the lighting space
- **Bayesian learning**: MCMC and variational inference underpin Bayesian deep learning and probabilistic programming (Stan, PyMC)
Предварительные знания
Monte Carlo Integration
Classical quadrature methods work well in low dimensions but suffer from the curse of dimensionality: for a d-dimensional integral with n points per axis, one need nᵈ evaluations. At d=10 and n=10, that's 10 billion operations. Monte Carlo elegantly solves this problem using randomness.
The integral ∫f(x)dx over domain Ω can be rewritten as an expectation: I = ∫_Ω f(x)dx = vol(Ω) · E[f(X)] where X is uniformly distributed over Ω. The estimate from N random points: Î_N = vol(Ω) · (1/N) · Σᵢ f(xᵢ) The Law of Large Numbers guarantees: Î_N → I as N → ∞.
Key property: **O(1/√N) convergence** is independent of dimension! Unlike quadratures where error is O(h^p) and exponentially more points are needed as d grows. This makes Monte Carlo indispensable for problems with d ≥ 4-5.
**Slow convergence**: to reduce error by 10×, one need 100× more points. For smooth 1D functions, quadratures are much better. **Randomness**: each run gives a slightly different answer. Fix the seed for reproducibility. **High variance**: if f(x) varies strongly, variance-reduction methods are needed (importance sampling, stratification).
How does the Monte Carlo error change when the number of points N increases by a factor of 100?
Quasi-Monte Carlo: Low-Discrepancy Sequences
Random points have an unpleasant property: they sometimes cluster in some regions and leave gaps elsewhere. Quasi-Monte Carlo (QMC) replaces random points with **deterministic low-discrepancy sequences**: they cover the space more uniformly.
**Discrepancy** D_N measures how unevenly the points are distributed. Koksma-Hlawka theorem: |Î_N - I| ≤ V(f) · D_N(x₁,...,x_N) where V(f) is the variation of f. For random points D_N ~ O(1/√N), for Halton and Sobol sequences D_N ~ O((log N)^d / N). For large N and moderate d, this is significantly better.
In practice QMC is used in finance (option pricing), rendering (path tracing in Pixar and Disney), and physical simulation. Scrambled Sobol sequences combine QMC theoretical guarantees with statistical error estimation. The `scipy.stats.qmc` library provides ready-to-use implementations.
What is the fundamental difference between quasi-Monte Carlo and standard Monte Carlo?
Importance Sampling: Variance Reduction
If f(x) is large in some region and small everywhere else, random points from a uniform distribution waste most computations. **Importance sampling** samples more frequently in important regions and corrects the estimate with weights.
The optimal proposal distribution: q*(x) ∝ |f(x)| · p(x). Then the estimator variance is zero-but computing it requires knowing the answer! Practical rules: - q(x) should be close in shape to |f(x)| · p(x) - q(x) must have heavier tails than f(x)·p(x)-otherwise some weights will be infinitely large - Self-normalized IS: use normalized weights w̃ᵢ = wᵢ/Σwⱼ-more stable Bad choice of q → high weight variance → poor estimate. Diagnostic: effective sample size ESS = (Σwᵢ)²/Σwᵢ².
Why must the proposal distribution q(x) in importance sampling have heavier tails than the target p(x)·f(x)?
MCMC: Sampling from Complex Distributions
What if one need to sample from a distribution p(x) known only up to a normalizing constant? In Bayesian statistics, p(θ|data) ∝ p(data|θ)·p(θ), where the normalization is a high-dimensional integral. **Markov Chain Monte Carlo (MCMC)** solves exactly this problem.
**Burn-in**: the first samples depend on the starting point-discard them. Typically 10-25% of total. **Mixing**: the chain must mix-visit all modes of the distribution. Slow mixing → correlated samples → more iterations needed. **Acceptance rate**: the optimal acceptance rate for high d ≈ 23.4% (theoretical result for Gaussian proposals). If higher-steps are too small; if lower-too large. **Gibbs sampling**: a special case of MCMC where each coordinate is sampled in turn from its conditional distribution. Efficient when conditionals p(xᵢ|x₋ᵢ) are known analytically.
Why is the acceptance of a proposed step probabilistic in the Metropolis-Hastings algorithm, rather than always accepted?
Key Ideas
- **MC integration**: estimate E[f(X)] from N random points, error O(1/√N) - independent of dimension
- **Quasi-MC**: deterministic sequences (Halton, Sobol) with discrepancy O((log N)^d/N) - faster for smooth f at moderate d
- **Importance sampling**: sample from q≈|f|·p, correct with weights w=p/q - variance reduction for rare events and tails
- **MCMC (Metropolis-Hastings, Gibbs)**: random walk accepting steps with probability p(x')/p(x) - converges to target without normalization
Related Topics
Monte Carlo bridges numerical methods and probabilistic programming:
- Numerical Integration — MC complements classical quadratures: better in high dimensions, slower in 1D
- Optimization Methods — Simulated annealing and evolutionary algorithms use MC ideas for stochastic optimization
- Numerical Linear Algebra in ML — Randomized linear algebra methods: the next step: Monte Carlo for matrix computations
Вопросы для размышления
- In which problems does Monte Carlo outperform deterministic numerical methods, and in which does it fall short?
- If importance sampling is so effective, why not always use it instead of plain MC?
- How would one diagnose that an MCMC chain has not yet mixed (poor mixing)?