Measure Theory

Lp Spaces

Why do MSE and MAE behave so differently? Why does L2 regularization (Ridge) shrink coefficients but never zero them out, while L1 (Lasso) produces exact zeros? The answer lies in the geometry of Lp spaces: the shape of the unit ball completely determines the behavior of the regularizer.

  • **Regularization in ML:** the L¹ penalty (Lasso) induces sparsity because the L¹ unit ball has corners on the coordinate axes; the L² penalty (Ridge) does not zero out coefficients because the L² ball is smooth
  • **Signal processing:** L² is the space of finite-energy signals (Parseval's theorem); L¹ captures absolutely summable Fourier series (functions with no Gibbs phenomenon)
  • **Neural network theory:** hypothesis classes for neural networks are described by Lp-type norms that quantify the 'complexity' of the function class

Предварительные знания

  • Convergence Theorems

The Lp Norm and Family of Spaces

L¹ is merely the first member of a rich family. For each p ∈ [1, ∞] there is a separate Lp space with a norm that measures the 'p-th power size' of a function. This family encompasses nearly every norm encountered in machine learning.

**Lp norm (1 ≤ p < ∞):** ‖f‖_p = (∫ |f|^p dμ)^{1/p} **L∞ norm (essential supremum):** ‖f‖_∞ = ess sup |f(x)| = inf { M : μ({|f| > M}) = 0 } **Space Lp(μ):** equivalence classes of measurable functions with ‖f‖_p < ∞, where f ~ g if f = g μ-a.e.

pNormUnit ball shape (R²)ML use
1‖f‖₁ = ∫|f| dμDiamond (rotated square)MAE, Lasso, sparsity
2‖f‖₂ = (∫f² dμ)^{1/2}Euclidean ball (circle)MSE, Ridge, neural nets
∞‖f‖_∞ = ess sup|f|Cube (Chebyshev)Robust optimization
p > 2‖f‖_pSuperellipseTheoretical Lp analysis
0 < p < 1‖f‖_p (not a norm!)Non-convexCompressed sensing (non-convex)

Special case: **L∞** consists of essentially bounded functions. Note that ‖f‖_∞ = lim_{p→∞} ‖f‖_p for functions with finite-measure support. The space L² is the most convenient: it is a Hilbert space (with inner product ⟨f,g⟩ = ∫fg dμ), while Lp for p ≠ 2 is only Banach.

Suppose f ∈ L²([0,1]). Does it follow that f ∈ L¹([0,1])?

Holder and Minkowski Inequalities

Two fundamental inequalities of Lp theory: **Holder's inequality** bounds the integral of a product, and **Minkowski's inequality** proves that ‖·‖_p is a genuine norm (satisfying the triangle inequality).

**Holder's inequality:** if 1/p + 1/q = 1 (p and q are called **conjugate exponents**), then for f ∈ Lp and g ∈ Lq: ∫ |f · g| dμ ≤ ‖f‖_p · ‖g‖_q Special case p = q = 2: the **Cauchy-Schwarz inequality** ∫|fg| dμ ≤ ‖f‖₂‖g‖₂. **Minkowski's inequality:** for f, g ∈ Lp: ‖f + g‖_p ≤ ‖f‖_p + ‖g‖_p This is the triangle inequality - what makes ‖·‖_p a norm.

**Application in ML:** Holder's inequality underpins regularization bounds. For p=q=2: |w·x| ≤ ‖w‖₂·‖x‖₂. This means the output of a linear layer is bounded by the product of weight and input norms - the standard argument for generalization bounds via norm-bounded hypothesis classes.

Conjugate pairs (p, q) with 1/p+1/q=1: (1,∞), (2,2), (3,3/2), (4,4/3). The pair (1,∞) connects L¹-loss (Lasso) with L∞-bounded perturbations (adversarial robustness). The pair (2,2) is the Euclidean inner product geometry of neural networks.

For p=4, what is the conjugate exponent q?

Completeness: The Riesz-Fischer Theorem

A space is **complete** if every Cauchy sequence converges to an element of the same space. Completeness is critical: without it there is no guarantee that optimization algorithms converge to a solution in the same function class.

**Riesz-Fischer Theorem:** for any 1 ≤ p ≤ ∞, the space Lp(μ) is **complete**. That is: if ‖fₙ − fₘ‖_p → 0 as n,m → ∞ (Cauchy sequence), then there exists f ∈ Lp(μ) with ‖fₙ − f‖_p → 0. Consequence: Lp(μ) is a **Banach space**. For p=2 it is a **Hilbert space** (complete with inner product ⟨f,g⟩ = ∫fg dμ).

Completeness of L² guarantees that **iterative optimization** (gradient descent, conjugate gradients) in L² spaces converges to an element of the same space. Without completeness, the 'limit' could fall outside the class, rendering the algorithm meaningless.

**A subtle point:** an element of Lp is not a single function but an equivalence class of functions that agree almost everywhere. Saying 'f(x) = 0' is imprecise; the correct statement is 'f = 0 μ-a.e.' or '‖f‖_p = 0'.

Why is L²([0,1]) preferred over C([0,1]) (continuous functions with the same L² norm)?

Lp Inclusions and Norm Geometry in ML

How do different Lp spaces relate to each other? On finite measure spaces (such as [0,1]) the spaces nest by decreasing index: the larger p, the stronger the integrability requirement.

**Lp inclusions on finite measure spaces:** if μ(X) < ∞ and 1 ≤ p ≤ q ≤ ∞, then: L^q(μ) ⊂ L^p(μ) and ‖f‖_p ≤ μ(X)^{1/p − 1/q} · ‖f‖_q In particular: L^∞ ⊂ ... ⊂ L² ⊂ L¹. **On infinite measure spaces (e.g., Lebesgue measure on ℝ) there are no inclusions:** one can find functions in L¹\L² and in L²\L¹.

**Norm geometry in ML:** the unit balls in ℝ² for different p have strikingly different shapes. The L¹ ball (a diamond) has corners on the coordinate axes - optimization with L¹ regularization hits these corners, producing sparsity. The L² ball (a circle) is smooth, explaining why Ridge regression does not zero out coefficients.

Deep learning predominantly uses the L² norm (MSE loss, weight decay). However, L¹ is critical for model compression and quantization: it identifies and zeros out negligible weights, reducing model size without quality loss. The elastic net combines L¹ and L² to get both sparsity and stability.

On [0,1] with Lebesgue measure, if f ∈ L³, then:

Key Ideas

  • **Lp norm:** ‖f‖_p = (∫|f|^p dμ)^{1/p} for p∈[1,∞); ‖f‖_∞ = ess sup |f|; p=2 gives a Hilbert space
  • **Holder:** ∫|fg| dμ ≤ ‖f‖_p · ‖g‖_q for conjugate 1/p+1/q=1; generalizes Cauchy-Schwarz
  • **Minkowski:** ‖f+g‖_p ≤ ‖f‖_p + ‖g‖_p (triangle inequality); **Riesz-Fischer:** Lp is complete for all 1≤p≤∞
  • **Inclusions:** on finite measure spaces L∞ ⊂ ... ⊂ L² ⊂ L¹; on ℝ no inclusions hold; the choice of p depends on the application

Related Topics

Lp spaces are the central objects of functional analysis and ML theory:

  • Convergence Theorems — DCT proves that Lp convergence is compatible with the Lebesgue integral
  • Duality and the Riesz Representation Theorem — The dual of Lp is Lq for conjugate p,q; this is the Riesz representation theorem
  • Measure Theory and Probability — Random variables with finite p-th moment are exactly the elements of Lp(Ω,F,P)

Вопросы для размышления

  • Why is L² uniquely convenient compared to other Lp spaces? What specific properties (inner product, orthogonal projections, Fourier series) make it privileged?
  • In a regression problem, when is L¹ loss (MAE) preferable over L² loss (MSE), and vice versa? How does Lp geometry relate to robustness against outliers?
  • Why are there no inclusions between L¹ and L² on ℝ? Construct a function in L¹(ℝ)\L²(ℝ) and one in L²(ℝ)\L¹(ℝ).

Связанные уроки

  • top-04
Lp Spaces

0

1

Sign In