Statistics

Estimation: The $1.5B Mistake of the Hubble Telescope

The Hubble Space Telescope launched in 1990 with a mirror ground 1.3mm too flat - a biased estimator with a systematic error. The fix cost $700M and a Space Shuttle mission. Every ML model trained without checking for bias repeats the same mistake at smaller scale.

  • Ridge and Lasso regularization trade bias for variance to reduce test error
  • Hubble telescope $700M mirror fix - the cost of an unchecked biased estimator
  • A/B testing: consistent estimators ensure results hold at production scale
  • BatchNorm uses biased variance estimator (divides by n, not n-1) by design
  • Thompson Sampling: Bayesian estimator for multi-armed bandit exploration
  • Bessel's correction (n-1): unbiased sample variance used in every stats package

Предварительные знания

  • (no prerequisites)
  • Sampling: how 1,000 people predict the behavior of a billion

Three Questions for Any Estimator

**April 24, 1990. NASA launches the Hubble Space Telescope.** A 2.4-meter primary mirror - the most precisely polished mirror ever created. Engineers at Perkin-Elmer checked its shape thousands of times with a special instrument called a null corrector. Every time: perfect. The first images from orbit came back blurry. The mirror had been polished with stunning precision - but **systematically wrong**. The null corrector had a design flaw: one lens was displaced by 1.3 mm. Every measurement produced the same error of 2.2 micrometers from the required shape. For three years, astronomy looked through warped glass. The 1993 repair mission cost another $700 million. **This is a story about a biased estimator.**

0

1

Sign In

**What this lesson actually teaches**: not "how to compute a sample mean", but why any estimator has **three independent properties** - and what happens when one of them breaks. Bias killed Hubble. Variance makes predictions unstable. Lack of consistency destroys trust at scale. Every time a loss function or normalization method is chosen, an estimator with specific properties is being selected. In 30 minutes the reason n-1 appears in the variance formula will be clear - and why L2 regularization deliberately introduces bias.

Three Questions for Any Estimator

**A point estimator** is any function of sample data used to approximate an unknown parameter. The sample mean X-bar is an estimator for mu. The sample variance is an estimator for sigma-squared. But a single parameter can have infinitely many estimators. The question: which one is right? Three criteria answer this.

PropertyQuestionWhat breaks when it failsML analogue
UnbiasednessDoes the estimator hit the target on average?Hubble: systematically wrong. Model with constant underfittingSystematic model error, bias in feature engineering
EfficiencyHow much does it scatter from sample to sample?Unstable predictions across different splitsOverfitting, high weight variance without regularization
ConsistencyDoes the estimator improve as n grows?The algorithm does not learn from dataModel that does not scale with training set size

**Bias-variance tradeoff** is not a metaphor - it is a literal formula: MSE(theta-hat) = Bias-squared(theta-hat) + Var(theta-hat). Every regularization method (L1, L2, dropout) trades one for the other. Ridge introduces bias but reduces variance. Understanding this makes hyperparameter tuning logical, not trial-and-error.

Which three independent properties characterize any point estimator?

Unbiasedness: Hitting the Target on Average

Unbiasedness: Hitting the Target on Average

An estimator is unbiased if, over infinitely many repetitions (drawing different samples and computing the estimate each time), it lands exactly on the true parameter on average. More precisely: the expected value of the estimator equals the true value.

Unbiasedness: E[theta-hat] = theta, meaning Bias(theta-hat) = E[theta-hat] - theta = 0 Sample mean X-bar = (X1 + ... + Xn) / n: E[X-bar] = E[(X1 + ... + Xn) / n] = (E[X1] + ... + E[Xn]) / n [linearity of E] = (mu + mu + ... + mu) / n = n*mu / n = mu (unbiased) Sample variance divided by n: S2_n = (1/n) * sum(Xi - X-bar)^2 E[S2_n] = (n-1)/n * sigma^2 != sigma^2 <- biased! Divided by n-1 (Bessel's correction): S2 = (1/(n-1)) * sum(Xi - X-bar)^2 E[S2] = sigma^2 (unbiased) WHY n-1? When X-bar is computed from the same sample, the deviations Xi - X-bar lose one degree of freedom: their sum is ALWAYS zero. Of n deviations, only n-1 are free - the last is determined by the rest.

Degrees of Freedom: Intuition Without Formulas

Why n-1

Sample: {10, 12, 14}. X-bar = 12. Deviations from the mean: {10-12, 12-12, 14-12} = {-2, 0, +2}. Sum of deviations = -2 + 0 + 2 = 0 - ALWAYS. This means: if the first two deviations (-2 and 0) are known, the third (+2) is determined automatically. Free deviations = n-1 = 2. Dividing by n-1 instead of n compensates for this loss: we 'pay' for using X-bar instead of the true mu.

Why does the sample variance S² divide by n-1 instead of n?

Efficiency and MSE: When Bias Pays Off

Efficiency and MSE: When Bias Pays Off

Among unbiased estimators, the one with the **smallest variance** is preferable - it will be closer to the true value for each specific sample. But unbiasedness is not the only criterion. Sometimes a slightly biased but much more stable estimator is the better choice.

Mean Squared Error combines both deficiencies: MSE(theta-hat) = E[(theta-hat - theta)^2] = Var(theta-hat) + Bias^2(theta-hat) Proof: E[(theta-hat - theta)^2] = E[(theta-hat - E[theta-hat] + E[theta-hat] - theta)^2] = Var(theta-hat) + (E[theta-hat] - theta)^2 Comparison of estimators for mu under N(mu, sigma^2): theta-hat1 = X-bar: Bias = 0, Var = sigma^2/n -> MSE = sigma^2/n theta-hat2 = X1: Bias = 0, Var = sigma^2 -> MSE = sigma^2 (n times worse!) At n=100: variance of X-bar is 100x smaller than X1. Both are unbiased - but X-bar is incomparably more efficient. Stein's paradox (1961): when estimating a vector mu in R^k (k>=3), X-bar is NOT optimal in MSE. A biased 'shrinkage' estimator with smaller MSE always exists. This is precisely the origin of L2 regularization in ML.

**A paradox**: at small n, the biased variance estimator (dividing by n) often has a **smaller MSE** than the unbiased one. This is because unbiasedness slightly increases the estimator's variance. In ML this justifies L2 regularization: weights are deliberately shifted toward zero, but they become more stable across runs and different data partitions.

MSE(θ̂) = Bias²(θ̂) + Var(θ̂). What does this decomposition mean in practice?

Consistency: Does the Estimator Get Smarter with Data?

Consistency: Does the Estimator Get Smarter with Data?

Unbiasedness is about the mean. Consistency is about behavior as n grows. **A good estimator must become more accurate as data is added.** This is the minimum common-sense requirement - and it does not follow automatically from unbiasedness.

Estimator theta-hat_n is consistent if: for all epsilon > 0: P(|theta-hat_n - theta| > epsilon) -> 0 as n -> infinity Notation: theta-hat_n ->_P theta X-bar is consistent - a consequence of the Law of Large Numbers: X-bar_n ->_P mu Examples showing unbiasedness and consistency can come apart: theta-hat = X1 (first observation): E[X1] = mu <- UNBIASED Var(X1) = sigma^2 <- does not decrease with n -> INCONSISTENT theta-hat = S2_n (divide by n): E[S2_n] = (n-1)/n * sigma^2 <- BIASED Bias -> 0 as n->inf, Var->0 -> CONSISTENT All four combinations are real: [unbiased + consistent]: X-bar for mu <- best case [unbiased + inconsistent]: X1 for mu <- useless [biased + consistent]: S2_n for sigma^2 <- acceptable [biased + inconsistent]: 2*X1 for theta!=2*mu <- failure

**Consistency is the minimum requirement.** An inconsistent estimator does not learn from data: at n=10 it is just as accurate (or inaccurate) as at n=10,000. The ML analogue: a model that does not improve as the training dataset grows is an inconsistent algorithm. This is a diagnosis, not a tuning problem.

Which estimator is unbiased but NOT consistent?

Where These Properties Live in Real Systems

Where These Properties Live in Real Systems

Bias, variance, consistency - these are not textbook abstractions. They are engineering characteristics of every component in an ML system. When Netflix estimates user preferences or Stripe estimates fraud probability, estimators with specific deliberately chosen properties are at work.

**Final frame**: Hubble was fixed in 1993 - the COSTAR corrective optics compensated for the systematic error. In statistics the analogue is debiasing: subtracting an estimate of the bias from the estimator. In ML - BatchNorm correction at inference, neural network calibration. Once it is understood what broke (bias or variance), the right tool for fixing it becomes clear.

Which ML component behaves like an inconsistent estimator?

Practice: Diagnosing Estimators

Practice: Diagnosing Estimators

Diagnostics: across 10 000 simulations the estimator's average equals the true parameter, but its variance does not shrink as n grows. Diagnosis?

Key Takeaways

  • **The Hubble lesson**: bias destroys results even with perfect precision. An estimator must be checked for bias before a system goes live
  • **Three independent properties**: unbiasedness (E[theta-hat]=theta), efficiency (minimum variance among unbiased), consistency (theta-hat->theta as n->infinity)
  • **n-1 in variance**: Bessel's correction compensates for the loss of one degree of freedom when computing X-bar from the same sample
  • **MSE = bias^2 + variance**: the unified quality measure. A biased estimator with smaller MSE beats an unbiased one with larger MSE - the foundation of Ridge and BatchNorm
  • **Unbiasedness != consistency**: X1 is unbiased but inconsistent. S2_n is biased but consistent
  • **In ML every day**: L2 regularization = deliberate bias; BatchNorm = biased variance on the batch; Thompson Sampling = consistent Bayesian estimator

What's Next

Now it is clear that estimates can be bad. Next: how to find the best one.

  • Maximum Likelihood Estimation — A universal method for building consistent and asymptotically efficient estimators
  • Confidence Intervals — Not a point but an interval with a coverage guarantee - the right way to report an estimate
  • Cramer-Rao Bound — The lower bound on the variance of any unbiased estimator - the theoretical foundation of efficiency
  • Bootstrap — Estimate bias and variance of any estimator without knowing the distribution - a modern practical tool

Связанные уроки

  • ml-08-regularization
Estimation: The $1.5B Mistake of the Hubble Telescope