Statistics

Robust Statistics

A single data entry error, a sensor with a dying battery, one patient with a rare anomaly - and a classical mean or regression outputs nonsense. Robust statistics designs estimators that survive the real world.

Financial risk: fat-tailed return distributions make mean-based VaR dangerous; robust covariance estimation is essential
Industrial sensors: noisy IoT data requires M-estimators for reliable signal aggregation
Computer vision: RANSAC is a robust geometric model fitter that tolerates large fractions of mismatched keypoints in stereo matching

Предварительные знания

Causal Inference

Breakdown Point: Measuring Estimator Robustness

In 1964 Peter Huber showed that a single contaminated observation out of 1000 is enough to break the sample mean. The breakdown point measures how many outliers an estimator survives: 0.5 for the median, and roughly 0.5 in dimension 50 for MCD.

**Three robustness criteria:** 1. Breakdown Point - a global measure (fraction of outliers the estimator tolerates) 2. Influence Function - local sensitivity to a small amount of contamination 3. Rejection Point - the minimum outlier magnitude beyond which the estimator stops reacting (for redescending M-estimators). An ideal estimator has a high BP, a bounded IF, and high statistical efficiency (close to OLS on clean Gaussian data).

An estimator T has a breakdown point of 0.25. If exactly 25% of the observations are replaced by arbitrary numbers, the estimator...

M-Estimators: Huber, Bisquare, Hampel

**M-estimators** minimize sum_i rho(x_i - theta), where rho is a loss function. The optimality condition is sum psi(x_i - theta) = 0, where psi = rho'. **Huber's loss:** rho(u) = u^2/2 for |u| <= k, k|u| - k^2/2 otherwise - quadratic near zero, linear in the tails. As k -> infinity: OLS; as k -> 0: median. **Tukey's bisquare (biweight):** completely rejects outliers beyond a threshold c: psi(u) = u(1-(u/c)^2)^2 for |u| < c, 0 otherwise.

**Choosing k in Huber:** k = 1.345 yields 95% asymptotic efficiency at the Gaussian model. Smaller k means higher robustness but lower efficiency. The scale sigma must be estimated robustly (MAD = Median Absolute Deviation / 0.6745) - otherwise a single outlier corrupts sigma and defeats the purpose. IRLS (Iteratively Reweighted Least Squares) is the standard algorithm for M-estimators: fast convergence, equivalent to weighted least squares at each step.

The bisquare psi function equals zero for |u| > c = 4.685. What happens to an observation with scaled residual r_i = 6 sigma?

S-Estimators, MM-Estimators, and MCD

An **S-estimator** minimizes a robust scale: min_theta s(r_1,...,r_n). It achieves a high BP (up to 50%) but low efficiency (~30% Gaussian). An **MM-estimator (Yohai 1987)** uses two stages: 1. an S-estimator provides a starting point theta and a robust scale s 2. a bisquare M-step refines theta at the fixed scale s. This combines the high BP of the S-step with the ~95% efficiency of bisquare. **MCD** (Minimum Covariance Determinant): finds the h >= n/2+1 observations whose covariance matrix has the smallest determinant - a multivariate robust location/scatter estimator.

**Masking and swamping:** masking occurs when a cluster of outliers conceals itself from a classical M-estimator; swamping occurs when clean observations are incorrectly flagged as outliers because the initial scale estimate was corrupted. MCD and MM-estimators are specifically designed to handle masking. For regression: MM-regression is available in R (robustbase::lmrob) and Python (statsmodels.robust.robust_linear_model).

An MM-estimator has two steps: an S-estimator (BP~50%, efficiency~30%), then a bisquare M-step. What is the breakdown point of the final MM-estimate?

Key Ideas

BP(mean) = 0, BP(median) = 0.5 - the maximum breakdown point for location estimators
Influence function: GES(OLS) = infinity, GES(median) is finite - sensitivity to a single outlier
M-estimators: sum psi(r_i/sigma) = 0; Huber is a compromise; bisquare completely rejects beyond c*sigma
MM-estimators: BP ~ 50% (from S-step) + efficiency ~ 95% (from bisquare M-step)
MCD: minimizes det(Sigma) over h observations, providing a robust multivariate location/scatter estimate