Statistics

Empirical Bayes

How can estimating all parameters jointly - using information about their collective distribution - improve accuracy over estimating each one independently?

**Genomics limma:** Smyth (2004) applies EB to t-statistics for 20,000 genes; became the standard for microarray analysis with over 20,000 citations
**Sports statistics:** Efron and Morris (1975) showed batting averages for 18 MLB players are estimated more accurately with joint shrinkage
**Recommendation systems:** Elo-like rating systems implicitly use EB - posterior team strength estimate through the common distribution
**Medical imaging:** EB spatial smoothing of PET/fMRI images through estimated spatial hyperparameters

Предварительные знания

Normal distribution
Bayesian inference
Risk and loss functions

Charles Stein proved in 1956 that when p >= 3, the MLE estimator mu-hat = X is inadmissible under squared-error risk: another estimator exists with lower risk at every value of the parameter. This overturned the prevailing intuition about MLE optimality.

EB for local FDR (Efron, 2008): in multiple testing, observe z-statistics z₁,...,z_m. The empirical mixture f(z) = π₀·f₀(z) + π₁·f₁(z), where f₀ = N(0,1) under H₀ and f₁ is an unknown density under H₁. EB estimates f(z) from the data (e.g., Poisson regression on the z-histogram). Local fdr(z) = π₀·f₀(z)/f(z) is the posterior probability that z comes from H₀.

Stein's Unbiased Risk Estimate (SURE): for any differentiable function g(X) with X ~ N(mu, σ²I), the risk E[‖g(X) - mu‖²] = p·σ² + E[‖g(X) - X‖²] + 2σ²·E[∇·g(X)]. This allows estimating the risk without knowing true mu. SURE is minimized for selecting the shrinkage threshold λ in wavelet denoising (Donoho & Johnstone, 1995), giving near-minimax adaptive estimators.

EB in credit scoring: μ_i = true creditworthiness score, observed x_i has noise σ_i² (depends on transaction history length). EB shrinks x_i toward the population mean: μ̂_i^EB = (1 - σ_i²/(σ_i² + B))·x_i + σ_i²/(σ_i² + B)·μ̂. New borrowers (large σ_i²) get more shrinkage toward the mean; experienced borrowers (small σ_i²) keep their historical estimate.

Connection to Bayesian inference: EB is an approximate Bayesian approach where hyperparameters are estimated from data instead of specifying a full prior. Full Bayes places a hyperprior on B; EB substitutes B-hat - saving computation with similar results in practice.

Wavelet denoising and EB: Donoho & Johnstone (1995) showed that soft thresholding wavelet coefficients at threshold lambda = sigma·sqrt(2·log n) is a nearly optimal adaptive estimator - the oracle estimator in the class of sparse signals. Through the SURE lens, it is an EB estimate with a double-exponential prior on coefficients. ImageJ implements this algorithm; the WaveThresh package in R provides a full suite.

Compound decision theory (Robbins, 1951) is the formal framework for EB: estimate theta_1,...,theta_n simultaneously to minimize total risk E[∑_i L(theta_hat_i, theta_i)]. The compound decision problem is simpler than n independent single-decision problems because information about the marginal distribution of theta_i can be shared. Robbins showed that the optimal compound decision rule depends on the unknown marginal f(x) = integral p(x|theta) pi(theta) d theta, which EB estimates from all n observations.

EB smoothing in spatial statistics: for disease mapping, observed death counts Y_i ~ Poisson(theta_i·n_i) in region i. The EB estimate shrinks the crude rate Y_i/n_i toward the global rate μ̂: theta_hat_i^EB = B·(Y_i/n_i) + (1-B)·μ̂, where B = Var(theta)/(Var(theta) + 1/n_i). Regions with small populations (small n_i) shrink heavily; densely populated regions (large n_i) keep their observed rate. Implemented in the R SpatialEpi package.

James-Stein estimator

Stein's paradox (1956): when estimating the mean vector θ ∈ ℝ^p of a multivariate normal N_p(θ, I) with p ≥ 3, the observed sample mean X̄ is inadmissible. There exists an estimator with uniformly smaller mean-squared error. A shocking result: the optimal one-dimensional estimator stops being optimal when several independent problems are combined.

What does Stein's paradox say about the sample mean X̄ as an estimator of the mean vector θ ∈ ℝ^p?

Stein proved that MSE(X̄, θ) = p·σ² (risk grows with p) while the James-Stein estimator θ̂_JS = (1 - (p-2)σ²/||X||²)X has strictly smaller MSE for any θ when p ≥ 3. Shrinkage towards zero effectively 'borrows strength' across independent problems, the origin of empirical Bayes.

empirical Bayesian shrinkage

Empirical Bayes (Robbins, 1955) uses data to estimate the parameters of the prior, then applies a Bayesian procedure. In the two-level model X_i | θ_i ~ N(θ_i, σ²), θ_i ~ N(μ, τ²), hyperparameters (μ, τ²) are estimated from the marginal X_i ~ N(μ, σ² + τ²) by method of moments or MLE.

Link to hierarchical modelling: EB approximates the fully Bayesian model where hyperparameters also have priors. EB ignores uncertainty in (μ, τ²) but is simpler and often loses very little precision at large p.

How does empirical Bayes differ from a fully Bayesian approach?

Full Bayes assigns priors to every unknown (including hyperparameters) and reasons via the joint posterior. EB stops one step short: it estimates (μ, τ²) via method of moments or marginal MLE from P(X_i | μ, τ²) and plugs in point estimates. Cheap, consistent for large p, but understates the uncertainty of low-level parameters.

EB for multiple testing

A hierarchical EB model naturally solves the multiple-testing problem. In Efron's two-component model, X_i is a mixture: fraction π_0 from the null density f_0 (noise) and 1-π_0 from the alternative f_1 (true effects). Local FDR fdr(x) = π_0 f_0(x) / f(x) is the posterior probability of the null at the observed X = x.

In microarrays EB methods (limma, locfdr) outperform classical BH in power, because they exploit additional structure: the joint distribution of all tests at once.

How does Efron's local FDR differ from Benjamini-Hochberg FDR?

Local fdr is a pointwise Bayesian measure answering 'how plausible is H_0 for this particular x_i'. BH FDR is a global frequentist measure of the expected false-discovery fraction E[V/R] among rejections. Local fdr is stricter on the distribution tails and ranks individual discoveries more precisely.

Empirical Bayes and related methods

EB connects classical statistics, Bayesian inference, and regularization through data-driven hyperparameter estimation.

Ridge regression — Ridge is the MAP under a Gaussian prior; λ plays the role of an EB-estimated variance
Hierarchical models — EB is a hierarchical model with point estimates of hyperparameters instead of full integration
FDR control — Efron's local fdr uses EB to estimate the null density and flag significant effects

Итоги

Stein (1956): MLE is inadmissible at p >= 3 - James-Stein estimator achieves lower E‖mu-hat - mu‖² at all mu
JS estimator shrinks X toward zero; risk = p·σ² - (p-2)²σ⁴/E‖X‖² < p·σ²
EB estimates hyperparameter B = Var(mu_i) from data; shrinkage coefficient B/(B+σ²) is optimal under Gaussian prior
limma: EB-moderates gene variances by mixing s_g² with global s_0²; t̃_g is stable for small n_g
EB approximates full Bayesian inference: hyperparameters are plugged in instead of marginalized

James-Stein estimator

What does Stein's paradox say about the sample mean X̄ as an estimator of the mean vector θ ∈ ℝ^p?

empirical Bayesian shrinkage

How does empirical Bayes differ from a fully Bayesian approach?

EB for multiple testing

In microarrays EB methods (limma, locfdr) outperform classical BH in power, because they exploit additional structure: the joint distribution of all tests at once.

How does Efron's local FDR differ from Benjamini-Hochberg FDR?

Итоги

Stein (1956): MLE is inadmissible at p >= 3 - James-Stein estimator achieves lower E‖mu-hat - mu‖² at all mu

JS estimator shrinks X toward zero; risk = p·σ² - (p-2)²σ⁴/E‖X‖² < p·σ²

EB estimates hyperparameter B = Var(mu_i) from data; shrinkage coefficient B/(B+σ²) is optimal under Gaussian prior

limma: EB-moderates gene variances by mixing s_g² with global s_0²; t̃_g is stable for small n_g

EB approximates full Bayesian inference: hyperparameters are plugged in instead of marginalized