Measure Theory
Product Measures and Fubini's Theorem
How is an expectation over a joint distribution computed? Why does marginalization in Bayesian models work the way it does? Fubini's theorem is the mathematical justification for integrating one variable at a time. Monte Carlo integration is its numerical counterpart.
- **Joint distributions:** independence of random variables means the joint distribution is a product measure; P_{(X,Y)} = P_X × P_Y
- **Bayesian marginalization:** p(y|x) = ∫ p(y|θ,x) p(θ) dθ is an iterated integral over a product measure, justified by Fubini's theorem
- **Monte Carlo integration:** the numerical implementation of Fubini; error is O(1/√N) regardless of dimension
Предварительные знания
Product Sigma-Algebras and Measures
How do we rigorously define integration over multiple variables? The answer is to build a measure on the product of spaces. This is precisely how joint distributions in probability theory are formalized.
**Product sigma-algebra:** for measurable spaces (X, F) and (Y, G), the **product sigma-algebra** F ⊗ G is the smallest sigma-algebra on X×Y containing all 'rectangles' A×B, where A ∈ F and B ∈ G. **Product measure:** for sigma-finite measures μ on (X,F) and ν on (Y,G), there is a unique measure μ×ν on (X×Y, F⊗G) such that: (μ×ν)(A×B) = μ(A) · ν(B)
**Independence = product measure:** random variables X and Y are independent if and only if their joint distribution P_{(X,Y)} equals the product measure P_X × P_Y. This is the fundamental definition, requiring no assumption about the existence of densities.
The Borel sigma-algebra on ℝ² equals B(ℝ) ⊗ B(ℝ), the product of the one-dimensional Borel sigma-algebras. This fundamental fact guarantees that the standard two-dimensional Lebesgue measure is a product measure λ × λ.
Random variables X and Y are independent. What does this mean in terms of product measures?
Fubini's Theorem and Tonelli's Theorem
Can the order of integration be exchanged in a double integral? For the Riemann integral this was a delicate question. Fubini's and Tonelli's theorems give precise conditions for the Lebesgue integral.
**Tonelli's theorem (non-negative case):** if f ≥ 0 is measurable on (X×Y, F⊗G), then: ∫_{X×Y} f d(μ×ν) = ∫_X (∫_Y f(x,y) dν(y)) dμ(x) = ∫_Y (∫_X f(x,y) dμ(x)) dν(y) The order of integration can be exchanged freely, with no extra conditions. **Fubini's theorem:** if f ∈ L¹(μ×ν), the same equality holds for sign-changing f.
**Marginalization as iterated integration:** in Bayesian statistics, the marginal likelihood p(y) = ∫ p(y|θ) p(θ) dθ is an iterated integral over the product measure. Fubini's theorem guarantees that integrating over θ first (for fixed y) gives the same result as any other valid order.
**Monte Carlo integration** is the numerical version of Fubini's theorem: ∫∫ f(x,y) dx dy ≈ (1/N) Σ f(xᵢ, yᵢ) with (xᵢ,yᵢ) drawn from μ×ν. Fubini guarantees that this estimate is consistent regardless of the order in which the variables are sampled.
Fubini's theorem allows the order of integration to be exchanged when:
When Fubini Fails: A Counterexample
What happens when the L¹ condition is violated? The classic counterexample shows two iterated integrals that give different values. This is not a contradiction; it simply means the function is not L¹-integrable over the product space.
**Fubini counterexample:** define on [0,1]×[0,1]: f(x,y) = (x² − y²) / (x² + y²)² Then: - ∫₀¹ (∫₀¹ f(x,y) dy) dx = π/4 - ∫₀¹ (∫₀¹ f(x,y) dx) dy = −π/4 The two iterated integrals give **different values**! The reason: f ∉ L¹([0,1]²), that is ∫∫ |f| d(λ×λ) = ∞.
**Practical lesson for ML:** when computing E_{(x,y)~P}[f(x,y)] via iterated integrals, first over x, then over y, verify that E[|f(X,Y)|] < ∞. If violated (e.g., heavy-tailed distributions), different integration orders can give different answers.
In deep learning, Fubini counterexamples can appear when computing gradients of expected losses with non-integrable tails. Always verify E[|L(θ,X)|] < ∞ before exchanging expectation and differentiation.
If ∫₀¹(∫₀¹ f dy)dx ≠ ∫₀¹(∫₀¹ f dx)dy, what does this imply?
Monte Carlo as Numerical Fubini
Monte Carlo integration is the numerical realization of Fubini's theorem. An integral over a product measure is estimated as a sample average. Measure theory explains why this works and what its accuracy is.
**Monte Carlo for a double integral:** by the strong law of large numbers and Fubini's theorem: E_{(x,y)~μ×ν}[f(x,y)] = ∫∫ f d(μ×ν) ≈ (1/N) Σᵢ f(xᵢ, yᵢ) where (xᵢ, yᵢ) ~ μ×ν (independent samples from the product measure). The error is O(1/√N), independent of dimension!
**Marginalization in probabilistic ML:** to compute a posterior predictive p(y*|x*,X,y) = ∫ p(y*|x*,θ) p(θ|X,y) dθ, MCMC or variational methods are used. Fubini's theorem guarantees that marginalization is valid: ∫∫ p(y*,θ|x*,X,y) dθ = p(y*|x*,X,y).
Quasi-Monte Carlo (QMC) replaces random points with low-discrepancy sequences (Sobol, Halton). This connects even more directly to Fubini: the iterated integration error converges at O(log(N)^d/N) rather than O(1/√N), a significant gain in moderate dimensions.
The main advantage of Monte Carlo for high-dimensional integration is:
Key Ideas
- **Product measure μ×ν** is the unique measure on X×Y with (μ×ν)(A×B) = μ(A)·ν(B); independence means joint = product
- **Tonelli:** for f ≥ 0, the order of integration is freely interchangeable; **Fubini:** for f ∈ L¹(μ×ν), the same holds for sign-changing f
- **Counterexample:** f(x,y) = (x²−y²)/(x²+y²)² gives different iterated integrals because f ∉ L¹
- **Monte Carlo:** E_{μ×ν}[f] ≈ (1/N) Σ f(xᵢ,yᵢ) with O(1/√N) error, dimension-free
Related Topics
Product measures connect measure theory to probability and computation:
- Duality and Riesz — Integral representations of functionals φ(f) = ∫fg dμ use the product measure structure
- Abstract Measure Theory — Marginalization and conditional distributions are Radon-Nikodym derivatives with respect to projections of the product measure
- Measure Theory and Probability — Joint distributions and conditional expectations are built from product measures and Fubini
Вопросы для размышления
- Why doesn't Tonelli's theorem require an L¹ condition? What happens when a non-negative function has infinite integral, can the iterated integrals still differ?
- Monte Carlo requires f ∈ L¹ in theory. How does this constraint show up in practice when computing expectations with heavy-tailed distributions?
- In variational autoencoders, the ELBO is E_{q(z)}[log p(x|z)] minus KL(q(z)||p(z)). Where does the product measure structure appear in this formula?