Measure Theory
Convergence Theorems
Policy gradient theorem: $\nabla J(\theta) = \mathbb{E}[\nabla \log \pi_\theta(a|s) \cdot Q(s,a)]$. Can $\nabla_\theta$ be moved inside the expectation? Only when the conditions of Lebesgue's dominated convergence theorem hold. All of RL theory - REINFORCE, PPO, SAC - rests on this exchange. The three theorems of this lesson - MCT, DCT, Fatou - give exact conditions for when it is legal.
- **Policy gradient (REINFORCE):** $\nabla_\theta \mathbb{E}_{\pi_\theta}[R] = \mathbb{E}_{\pi_\theta}[R \cdot \nabla \log \pi_\theta]$ - differentiation under the expectation sign, justified by DCT when a dominator for $\nabla \log \pi_\theta$ exists
- **EM algorithm and VAE:** ELBO increases monotonically at each step - MCT structure. The theorem guarantees convergence to a local maximum without additional conditions
- **Strong law of large numbers:** $(1/n) \sum X_i \to \mathbb{E}[X]$ almost surely is proved through MCT and Fatou's lemma applied to random variables
- **Regularization as UI:** L2 regularization, gradient clipping, batch norm enforce uniform integrability, converting theoretical L1 convergence into algorithmic convergence
Предварительные знания
Monotone Convergence Theorem (MCT)
Limits and integrals cannot simply be exchanged. The trap: $f_n(x) = n \cdot \mathbf{1}_{[0, 1/n]}$. Each function is integrable. $\lim f_n = 0$ almost everywhere. But $\int f_n = 1$ for every $n$. Infinity at a single shrinking point kills convergence. This is precisely why conditions are needed - and the three theorems of this lesson give an exhaustive answer.
MCT (Beppo Levi) covers the cleanest case: non-negative functions that increase monotonically. No subtleties about escaping mass.
**Monotone Convergence Theorem (MCT):** if $f_n \geq 0$ are measurable and $f_n \uparrow f$ (increase monotonically almost everywhere), then: $$\lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu$$ Equality, not an inequality. Monotonicity prevents mass from escaping. The integral of the limit equals the limit of the integrals.
The EM algorithm runs on an MCT-like structure: the ELBO lower bound increases monotonically at each E+M step. MCT's guarantee: a monotonically increasing sequence that is bounded above must converge. This is why EM converges to a local maximum of the likelihood without additional conditions.
| Theorem | Condition | Conclusion | ML application |
|---|---|---|---|
| MCT (Beppo Levi) | $f_n \geq 0$, monotone increasing | $\lim \int f_n = \int f$ (equality) | EM algorithm, ELBO convergence |
| DCT (Lebesgue) | $|f_n| \leq g \in L^1$, $f_n \to f$ a.e. | $\lim \int f_n = \int f$ (equality) | Policy gradient, differentiation under integral |
| Fatou's lemma | $f_n \geq 0$ | $\int \liminf f_n \leq \liminf \int f_n$ | SGD convergence proofs |
If $f_n \to f$ almost everywhere, then $\int f_n \to \int f$ automatically
Almost everywhere convergence does not imply convergence of integrals without additional conditions (monotonicity or domination).
The trap: $f_n = n \cdot \mathbf{1}_{[0,1/n]}$. Converges to 0 a.e., but $\int f_n = 1$ for every $n$. Mass concentrates at a point that shrinks to zero.
The sequence $f_n = \mathbf{1}_{[0,1/n]}$ on $[0,1]$ converges to $f \equiv 0$. Does $\lim \int f_n = 0$?
The Dominated Convergence Theorem (DCT)
Policy gradient theorem: $\nabla J(\theta) = \mathbb{E}_{\pi_\theta}[\nabla \log \pi_\theta(a|s) \cdot Q(s,a)]$. Can $\nabla_\theta$ be moved inside the expectation? Only when the conditions of Lebesgue's dominated convergence theorem hold. All of RL theory - REINFORCE, PPO, SAC - rests on this exchange, invoking DCT implicitly every time.
**Lebesgue's Dominated Convergence Theorem (DCT):** let $f_n \to f$ almost everywhere, and suppose there exists $g \in L^1(\mu)$ with $|f_n| \leq g$ a.e. for all $n$. Then: 1. $f \in L^1(\mu)$ 2. $\lim_{n \to \infty} \int f_n \, d\mu = \int f \, d\mu$ 3. $\lim_{n \to \infty} \|f_n - f\|_1 = 0$ $g$ is called the **dominating function**. The condition $g \in L^1$ is essential - without it the theorem fails.
**Differentiation under the integral sign** (Leibniz rule): if $F(\theta) = \int f(x, \theta) \, d\mu(x)$ and $|\partial f / \partial \theta| \leq g(x) \in L^1$, then $F'(\theta) = \int \partial f / \partial \theta \, d\mu(x)$. Proof: apply DCT to difference quotients $(f(x, \theta+h) - f(x, \theta))/h$ as $h \to 0$.
**The trap without a dominator:** $f_n(x) = n \cdot \mathbf{1}_{[0,1/n]}$. Converges to 0 almost everywhere. But $\sup_n f_n(x) = +\infty$ at $x = 0$ - no dominator in $L^1$ exists. $\int f_n = 1$ for all $n$: the limit of integrals is not zero. DCT without a dominator fails - always verify.
The Leibniz rule is always valid for smooth functions
An integrable dominator for the partial derivative is required. Smoothness of the function does not guarantee integrability of the partial derivative.
Policy gradient is valid because the policy is usually parameterized so that $|\nabla_\theta \log \pi_\theta|$ has an integrable dominator. In pathological cases (unbounded policy gradients) the algorithm can diverge.
$f_n(x) = n \cdot \mathbf{1}_{[0,1/n]}(x) \to 0$ almost everywhere on $[0,1]$. Does $\lim \int f_n \, d\lambda = 0$?
Fatou's Lemma
Fatou's lemma is the weakest of the three results - only an inequality, only for non-negative functions. Yet both MCT and DCT are proved using it. It is the ground floor on which the entire limit theory stands.
**Fatou's Lemma:** if $f_n \geq 0$ are measurable, then: $$\int \liminf_{n \to \infty} f_n \, d\mu \leq \liminf_{n \to \infty} \int f_n \, d\mu$$ Intuition: mass can escape to infinity or concentrate at a shrinking point, making the integral of the limit strictly less than the limit of the integrals.
**Application in SGD convergence proofs.** A typical argument: write the loss as an integral, apply Fatou's lemma to get a lower bound on $\liminf$ of the loss, then show $\liminf = 0$. For example, in the Robbins-Monro theorem: $\liminf_n \|\nabla L(\theta_n)\|^2 = 0$ almost surely - proved precisely through Fatou.
**ELBO and Fatou's lemma:** the Evidence Lower BOund in VAE is $\mathbb{E}_q[\log p(x|z)] - KL(q||p)$. When optimizing over $q$ (the E-step) ELBO increases monotonically. Fatou's lemma ensures that taking the infimum over $q$ does not overestimate the integral: $\int \liminf_q$ is well-defined.
Fatou's lemma is less important than DCT because it only gives an inequality
Fatou's lemma is the foundation: both MCT and DCT are proved through it. An inequality in the right direction provides a lower bound sufficient for most applications.
MCT is proved as a consequence of Fatou for monotone sequences. DCT is built from Fatou plus the reverse Fatou. The ground floor carries the whole building.
For non-negative measurable $f_n$, Fatou's lemma states:
Uniform Integrability and Vitali's Theorem
When does $\mathbb{E}[X_n] \to \mathbb{E}[X]$? DCT gives a sufficient condition (a dominator). Uniform integrability (UI) is exactly the necessary and sufficient condition under convergence in measure. Regularization in ML (L1/L2/weight decay) enforces UI for the family of loss functions.
**Uniform integrability (UI):** the family $\{f_n\}$ is uniformly integrable if: $$\lim_{M \to \infty} \sup_n \int_{|f_n| > M} |f_n| \, d\mu = 0$$ Intuition: the tails of all $f_n$ are simultaneously small above a sufficiently large threshold $M$. The functions cannot all escape to infinity at the same time. **Vitali's convergence theorem:** $f_n \xrightarrow{L^1} f$ if and only if $f_n \to f$ in measure and $\{f_n\}$ is uniformly integrable.
**Why this matters for ML.** L1 convergence of neural network weights requires UI for the family of gradients. Without regularization, gradients may fail to be UI - hence gradient explosion. Weight decay (L2), gradient clipping, batch normalization all enforce uniform integrability among other things, converting theoretical convergence into algorithmic convergence.
Gradient clipping and L2 regularization are purely practical stabilization tricks
These techniques enforce uniform integrability of the gradients, which theoretically guarantees L1 convergence and the validity of limit exchanges.
Without UI one cannot guarantee $\mathbb{E}[\nabla L_n] \to \mathbb{E}[\nabla L]$. Vitali's theorem: L1 convergence is equivalent to convergence in measure plus UI. Regularization enforces UI - a theoretical justification for a practical result.
For the Leibniz rule $\frac{d}{d\theta} \int f(x,\theta) \, d\mu = \int \frac{\partial f}{\partial \theta} \, d\mu$ to hold, what is required?
Key Ideas
- **Limit and integral cannot simply be exchanged:** $f_n = n \cdot \mathbf{1}_{[0,1/n]} \to 0$ a.e., but $\int f_n = 1$ - the canonical trap
- **MCT:** monotone increasing + non-negative gives equality $\lim \int f_n = \int f$. Foundation of the EM algorithm
- **DCT:** dominator $g \in L^1$ + a.e. convergence gives equality. Foundation of policy gradient and the Leibniz rule
- **Fatou's lemma:** only $f_n \geq 0$, only inequality $\int \liminf f_n \leq \liminf \int f_n$. Ground floor of all limit theory - MCT and DCT are proved through it
- **Uniform integrability:** necessary and sufficient for L1 convergence. Regularization in ML enforces UI
Related Topics
Convergence theorems connect the Lebesgue integral to probability and ML optimization:
- The Lebesgue Integral — MCT and DCT are the main tools for working with the Lebesgue integral
- Lp Spaces — L1 convergence is a special case of Lp; DCT proves completeness of L1
- Measure Theory and Probability — Strong LLN and CLT rely on the limit theorems from this lesson
- Measure Theory in ML — Policy gradient, ELBO, EM - direct applications of DCT and MCT in ML
Вопросы для размышления
- In the trap $f_n = n \cdot \mathbf{1}_{[0,1/n]}$ the mass does not escape to infinity - it concentrates at a shrinking point. What does this say about the nature of 'escaping mass' in Fatou's lemma?
- In what real ML scenarios might the DCT condition (an integrable dominator) fail - and what does that mean for the algorithm?
- Fatou's lemma gives an inequality rather than equality. What does a strict inequality mean in probabilistic terms - what exactly is being lost in the limit?
Связанные уроки
- mt-04 — DCT is a tool for working with the Lebesgue integral - its construction is needed
- mt-06 — L1 convergence is a special case of Lp; DCT proves completeness of L1
- mt-10 — The strong law of large numbers and CLT are proved via MCT and Fatou's lemma
- calc-01-sequences — Limit theorems for functions generalize convergence of numerical sequences
- mt-11 — Policy gradient, ELBO, EM algorithm - direct applications of DCT in ML
- stat-02-estimation