Statistics
High-Dimensional Statistics and Sparsity
How can parameters be estimated in models with thousands of features and only hundreds of observations, when classical OLS is degenerate?
- **UK Biobank genomics:** p ~ 800,000 SNP markers with n ~ 500,000 participants; LASSO selects a sparse set of disease-associated variants for downstream validation
- **Neuroimaging:** fMRI data with p > 10^5 voxels; sparse regression to find active brain regions
- **Finance:** portfolio selection from 3,000 stocks with 252 trading days per year; elastic net for stable weight estimation
- **Compressed sensing:** MRI scanner reconstructs images from 20% of measurements via L1 minimization
Предварительные знания
- Linear regression and OLS
- Vector norms (L1, L2)
- Convex optimization
When p > n, the matrix X^TX is singular: OLS has no unique solution. Regularization adds a complexity penalty. The L1 penalty (LASSO) induces sparsity: many coefficients are driven to exactly zero, performing automatic variable selection alongside estimation.
Group LASSO extends LASSO to structured sparsity: if features are partitioned into groups G₁,...,G_k, the penalty ∑_g λ·‖β_{G_g}‖₂ selects entire groups rather than individual features. Applied in genomics (SNP groups within one gene), multitask learning (features of one type), and text analysis (n-grams of one word).
Debiased LASSO provides valid inference in high dimensions: β̂_debiased = β̂_LASSO + (1/n)·Θ̂·X^T(y - Xβ̂_LASSO), where Θ̂ is an approximate inverse of (1/n)X^TX. The debiased estimator is asymptotically normal, enabling confidence intervals for individual components of β even when p >> n. Implemented in the HDI package in R.
Compressed Sensing (Candès, Romberg, Tao, 2006): if a signal is s-sparse in basis Psi, then m = O(s·log(n/s)) measurements via random matrix Phi are sufficient for exact recovery via L1 minimization: min ‖z‖₁ subject to Phi·Psi·z = y. The matrix Phi·Psi satisfies RIP with high probability when m = O(s·log(n/s)). This is the theoretical foundation of next-generation MRI scanners operating at 5x lower acquisition time.
Practical LASSO guidelines: λ is selected via cross-validation (cv.glmnet in R) or the rule λ = σ√(2·log(p)/n). SCAD and MCP are nonconvex penalties that produce unbiased estimates of nonzero coefficients at the cost of nonconvex optimization.
The knockoff filter (Barber & Candes, 2015) controls FDR in variable selection via LASSO without distributional assumptions. Knockoff variables X̃_j are constructed as model-X knockoffs uncorrelated with Y given X. The statistic W_j = |Z_j| - |Z̃_j| (difference of LASSO coefficients) is used for FDR control: variables with W_j above threshold T are selected with guaranteed FDR <= q. The model-X version requires only knowledge of the covariate distribution, not the outcome model.
Matrix completion (Candes & Recht, 2009) solves the Netflix recommendation problem: recover a low-rank matrix M from a subset of observed entries. The nuclear norm ‖M‖_* = ∑_i σ_i (sum of singular values) is the convex relaxation of the rank. Nuclear norm minimization succeeds when the observed entries are drawn uniformly at random and n >= C·r·max(m,n)·log(max(m,n)), where r is the true rank. LASSO for matrices.
LASSO and L1 regularization
LASSO (Tibshirani, 1996) is linear regression with an L1 penalty on the absolute value of coefficients. Unlike ridge regression (L2 penalty), the L1 norm has 'sharp corners' at zero, so the solution exactly zeros out some coefficients: LASSO simultaneously estimates and selects features. Critical in the p >> n regime where OLS is undefined.
Elastic Net = α·L1 + (1-α)·L2 combines LASSO with ridge: it selects whole groups of correlated features (LASSO picks one of the group at random).
Why does LASSO produce exact zeros while ridge regression only shrinks coefficients towards zero?
Geometrically the loss contour is an ellipsoid; the constraint ||β||_1 ≤ t is a diamond with vertices on the axes; ||β||_2 ≤ t is a ball. The ellipsoid more often touches the diamond at a vertex (where some coordinates are exactly zero) than the ball (whose surface is smooth). The KKT condition confirms it: when |XⱼᵀRes/n| < λ, the optimum sits at β_j = 0.
restricted isometry property RIP
RIP (Restricted Isometry Property, Candès-Tao) is a property of the design matrix X that guarantees stable recovery of a sparse β from y = Xβ + ε. A matrix X ∈ ℝ^{n×p} satisfies RIP of order s with constant δ_s if, for all s-sparse vectors x: (1-δ_s)||x||² ≤ ||Xx||²/n ≤ (1+δ_s)||x||².
What does the RIP condition guarantee for a design matrix X in high-dimensional regression?
RIP is not a property of the whole matrix but of every one of its s-column submatrices. The condition (1-δ_s)||x||² ≤ ||Xx||²/n ≤ (1+δ_s)||x||² means singular values of every s-submatrix lie in [√(1-δ_s), √(1+δ_s)]. This guarantees that LASSO/Basis Pursuit/OMP separate true nonzeros from noise.
oracle inequalities
An oracle inequality is an upper bound on estimator error that compares it to a hypothetical 'oracle' that knows the true support S = {j : β*_j ≠ 0}. Intuition: a real estimator should not lose much to the oracle, even without knowing S. For LASSO such bounds were proven by Bickel-Ritov-Tsybakov (2009) and Candès-Tao.
Minimax optimality: the rate σ²·s·log(p)/n cannot be improved by any estimator, it is a lower bound for s-sparse linear regression in p dimensions.
What does the oracle inequality for LASSO show?
The oracle is a hypothetical estimator that knows S = supp(β*) and runs OLS on those |S| features with rate σ²|S|/n. Without knowing S, LASSO pays a logarithmic price log p for searching for the sparse solution but attains nearly the oracle rate under RIP / irrepresentable condition. This rate is minimax optimal.
High-dimensional statistics and related areas
LASSO connects sparse statistics, compressed sensing, and information theory.
- Compressed sensing — RIP and L1 minimization give identical recovery guarantees for sparse signals
- Bayesian interpretation — LASSO equals the MAP estimate under a Laplace prior on the coefficients
- Supervised learning — Regularized regression is the baseline feature-selection tool for high-dimensional ML
Итоги
- LASSO minimizes RSS + λ‖β‖₁; L1 penalty creates exact zeros, performing variable selection
- Oracle λ ~ √(log p / n) balances bias and variance; LASSO error rate ~ s*·log(p)/n
- RIP: X/√n is nearly isometric on s-sparse vectors; random matrices satisfy RIP at n >= Cs·log(p/s)
- Elastic Net = L1 + L2 penalties: sparsity + stability under highly correlated features
- Sign consistency: LASSO recovers S* when irrepresentable condition holds and min|β*_j| >> λ
- Debiased LASSO: asymptotically normal estimator for individual components when p >> n