Statistics
Hierarchical Models (Full Bayes)
How can data from multiple groups (schools, clinics, countries) be combined to improve estimates for each group without collapsing everything into a single pool?
- **Eight Schools experiment (Rubin 1981):** hierarchical model estimates SAT coaching effects across 8 schools with partial pooling
- **Pharmaceutical meta-analysis:** hierarchical model for drug efficacy across 20 clinical centers; Stan runs in major pharma companies worldwide
- **Education PISA:** Item Response Theory - hierarchical model for student abilities and item difficulties across 70+ countries
- **Sports analytics:** estimating home field advantage in football leagues via hierarchical model with partial pooling across teams
Предварительные знания
- Bayesian inference
- Normal model
- Empirical Bayes
Hierarchical models formalize information sharing between groups. The hyperparameter tau controls group similarity: as tau approaches 0, all groups become identical (complete pooling); as tau approaches infinity, groups are independent (no pooling). Bayesian inference automatically finds the optimal tau from data.
The funnel problem in hierarchical models: when data is sparse, tau is near zero and theta_j concentrate around mu. In the (tau, theta_j) space the posterior has a funnel shape - narrow near tau → 0 and wide near tau → infinity. Standard HMC struggles with the funnel: a step size optimal for the wide region is too large for the narrow region. Non-centered parameterization (theta_j = mu + tau·z_j, z_j ~ N(0,1)) transforms the geometry to be more uniform.
Random effects meta-analysis is a direct application of hierarchical models: theta_k ~ N(mu, tau²) for k = 1,...,K studies. The parameter mu is the overall effect, tau² is between-study heterogeneity. REML estimation of tau² corresponds to the EB approach. I² = tau²/(tau² + sigma²_typical) measures the proportion of total variance due to heterogeneity; I² > 0.75 indicates high heterogeneity requiring careful interpretation.
Bayesian optimization applies hierarchical models: a GP surrogate model f(x) is updated after each observation y = f(x) + epsilon. The acquisition function selects the next point x* to evaluate. Expected Improvement EI(x) = E[max(f(x) - f*, 0)] balances exploration and exploitation. BoTorch and Ax implement hierarchical GP models for ML hyperparameter tuning at industrial scale.
Stan best practices: weakly informative priors for tau - Half-Normal(0, 1) or Exponential(1), not Uniform(0, infinity). Gelman recommends non-centered parameterization (theta_j = mu + tau·z_j, z_j ~ N(0,1)) for hierarchical models with small n_j, to avoid funnel geometry in the posterior.
Multilevel regression and poststratification (MrP) uses hierarchical models for small-area estimation: national survey data is modeled with a hierarchical logistic regression (individual-level predictors + group-level predictors + random effects for state/district). Predictions are poststratified to the census population cells. Used by the Economist and FiveThirtyEight to estimate U.S. state-level opinion from national polls with n ~ 1000.
Stochastic volatility model is a canonical hierarchical time series model: log(sigma_t²) = mu + phi·log(sigma_{t-1}²) + eta_t, Y_t ~ N(0, sigma_t²). The latent volatility process sigma_t² is hidden and follows AR(1) in log-space. Full Bayesian inference via NUTS in Stan provides joint posterior over (mu, phi, sigma_eta, sigma_1,...,sigma_T), capturing uncertainty about both parameters and latent states.
hierarchical Bayesian structure
A hierarchical model organises parameters in levels: observations y_{ij} depend on group parameters θ_j, and θ_j on hyperparameters φ. Full Bayes assigns priors to all unknowns, including φ. This lets us account for individual and group information at the same time, without choosing between complete pooling (all groups identical) and no pooling (groups independent).
'Eight schools' (Rubin 1981) is the canonical example: 8 schools estimate SAT-coaching effects, the hierarchy bridges 'no effect' and 'different effects' through an N(μ, τ²) distribution on school-level θ_j.
What is the advantage of a hierarchical model over modelling each group independently?
The hierarchy is a compromise between two extremes: complete pooling (θ_j = μ for all j) understates heterogeneity, no pooling (θ_j independent) gives noisy estimates for small groups. The shared level φ lets 'weak' groups pull towards the mean while 'strong' groups deviate from it. The effect is especially pronounced with unequal group sizes.
partial pooling
Partial pooling is a consequence of the hierarchical structure. The posterior mean of group parameter θ_j is a weighted combination of the sample mean ȳ_j (no pooling) and the grand mean μ (complete pooling). Weights are set by data and group precisions: small n_j or noisy data, more weight on μ; lots of data or large between-group variance, more weight on ȳ_j.
Funnel plot: visualise |θ̂_j - μ̂| against n_j. Without pooling: a diverging funnel; with partial pooling, a compressed one, especially for small n_j.
What happens to the estimate of θ_j as n_j grows in a normal hierarchical model?
B_j = (σ²/n_j) / (σ²/n_j + τ²): as n_j → ∞ the numerator → 0, B_j → 0, and E[θ_j | y] = (1-B_j)ȳ_j + B_j·μ → ȳ_j. Large groups 'trust' their data; small groups borrow from the grand mean. This automatic behaviour is a strong advantage of hierarchical models.
MCMC and HMC-NUTS
Sampling from the posterior p(θ, φ | y) is the main computational tool in hierarchical Bayes. Metropolis-Hastings performs poorly in high dimensions due to random-walk inefficiency. Hamiltonian Monte Carlo (HMC) uses the log-density gradient for long informed jumps; NUTS (No-U-Turn Sampler) automatically tunes the trajectory length.
Why is HMC fundamentally more efficient than Metropolis-Hastings in high dimensions?
In p dimensions, random-walk M-H needs a step of order 1/√p for acceptable acceptance, so mixing is slow (O(p) steps per independent sample). HMC integrates Hamiltonian dynamics, keeping high acceptance even with large steps, and autocorrelation decays as p^{1/4} instead of p. NUTS adds automatic trajectory-length tuning.
Hierarchical models and related fields
Hierarchical Bayesian models are the standard for grouped-data analysis in medicine, psychology, and economics.
- Linear mixed models (LME) — Frequentist LME is a hierarchical model with a point estimate of τ instead of a full posterior
- Meta-analysis — Study-level hierarchy yields random effects and an estimate of between-study variance τ²
- Stan / PyMC — Primary tools for NUTS sampling from the posterior of hierarchical models
Итоги
- Hierarchical model: y_{ij}|theta_i, theta_i|tau ~ N(mu, tau²); tau controls group similarity
- Partial pooling: theta-hat_j is a weighted combination of y-bar_j and mu-hat; small n_j shrinks more toward global mean
- HMC uses the gradient of log p(theta|y) to propose distant points with high acceptance rate
- NUTS automatically selects trajectory length, eliminating manual HMC tuning
- R-hat < 1.1 by Gelman-Rubin is the standard convergence criterion for multiple MCMC chains
- Non-centered parameterization: theta_j = mu + tau·z_j eliminates posterior funnel geometry