Statistics

Mixed Effects Models

'Does the new teaching method improve achievement?' If one compare schools, students in the same school are similar to each other - this violates standard assumptions. Mixed models are the standard in psychology, education, medicine, and neuroscience whenever data is 'grouped'.

Education: effect of teaching methods accounting for pupils nested in classes and schools
Medicine: multi-centre trials - patients in hospitals, EEG over time
Neuroscience: fMRI - repeated measurements of one brain, multiple subjects
A/B testing: one user sees many pages - observations within a user are dependent
Longitudinal studies: growth, income, health tracked over years

Предварительные знания

Fixed and Random Effects

Google A/B tests across 50+ countries: ignoring the country hierarchy causes 20% false positives. Mixed effects models (LME4 in R, statsmodels in Python) correct this. Pfizer ran COVID vaccine trials with mixed models across 150 sites - the random site effects absorbed 30% of variance, making treatment effect estimation more precise.

Data Type	Problem	Solution
Students in schools	Students in the same school are correlated	Random intercept by school
Patients in hospitals	Patients at the same hospital are correlated	Random intercept by hospital
Repeated measurements	Measurements from the same subject are correlated	Random intercept (and slope) by subject
Products in stores	Sales in the same store are correlated	Random intercept by store
Trials in an experiment	Attempts by the same participant are correlated	Random intercept and slope by participant

**ICC (Intraclass Correlation Coefficient)** - the fraction of total variance explained by grouping. ICC = σ²_between / (σ²_between + σ²_within). ICC > 0.05: mixed models are needed. ICC ≈ 0: ordinary regression is acceptable. High ICC means observations within groups are very similar - a violation of the independence assumption.

Experiment: 50 participants, each completing 20 trials. one want to know whether task difficulty affects reaction time. Why can't one use ordinary ANOVA?

lmer Models: Random Intercepts and Slopes

**Random intercept:** each group (school, patient) has its own baseline level, but the same predictor effect. **Random slope:** the effect of a predictor differs across groups (some groups respond more strongly to treatment). Notation: `(1 | group)` - random intercept; `(1 + predictor | group)` - intercept and slope.

**Model comparison via BIC/AIC:** choose between a random-intercept model and a random-intercept-plus-slope model using a likelihood ratio test (`result.compare_lr_test()`) or AIC/BIC. The more complex model (intercept + slope) is better suited for repeated measures where individual trajectories differ.

one are modelling patient weight on a diet at 3 time points. The model formula is: `bp ~ time + (1 | patient)`. What does `(1 | patient)` mean?

Nested Data: Pupils in Classes in Schools

**Nested data** - a three-level (or deeper) hierarchy: pupils (level 1) → classes (level 2) → schools (level 3). Each level adds a random effect. Cross-classification: observations belong to multiple groupings simultaneously (a student attends different teachers and different subjects).

**When to use mixed models:** 1. repeated measurements on the same subjects 2. nested data (pupils in schools) 3. longitudinal studies 4. multi-centre clinical trials (patients across different hospitals) 5. A/B testing with multiple metrics measured on the same user.

Multi-centre clinical trial: 500 patients across 20 hospitals. one want to estimate the effect of a new drug on blood pressure. Which model should one use?

Key Ideas

Hierarchical data violates the independence assumption - mixed models are needed
ICC > 0.05: random effects are necessary
Fixed effects: what we want to estimate (treatment, age)
Random effects: grouping variables (hospital, subject)
(1 | group) - random intercept; (1 + x | group) - intercept and slope
statsmodels.mixedlm and pingouin - Python tools for mixed models
For three-level data - nested random effects

Connections to Other Methods

Mixed models generalise ANOVA (repeated measures ANOVA is a special case), and are related to GEE, Bayesian hierarchical models (the most flexible approach), and multilevel modelling (HLM).

ANOVA — Repeated measures ANOVA is a special case of a mixed model
Bayesian Statistics — Hierarchical Bayesian models are mixed models with priors

Вопросы для размышления

What is the difference between a random intercept and a random slope? Give an example where both are important to model.
Why is adding a hospital fixed effect (dummy variables) not equivalent to a hospital random effect? When would one prefer each?
ICC = 0.40 for student data. What does this mean for the standard errors of coefficients in ordinary regression - will they be too large or too small?

Связанные уроки

la-13-eigenvectors