Statistics
Factor Analysis: Latent Variables
'What is intelligence, really?' - in 1904 Charles Spearman applied factor analysis to test scores and discovered the g-factor. Since then FA has revealed the structure of personality (Big Five), consumer preferences, and portfolio risk. FA is an X-ray of hidden reality.
- Psychometrics: the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) - the result of FA on thousands of adjectives
- Marketing: hidden buyer motivations behind questionnaire responses
- Genomics: haplogroups as latent factors of SNP markers
- Finance: factor models of returns (Fama-French 5-factor model)
- Neuroscience: independent components of fMRI signals (ICA)
Предварительные знания
Latent Variables: What Lies Behind the Data
Word2Vec, BERT embeddings, and PCA are factor analysis under different names. OpenAI's ada-002 embeds 300M+ texts using 1536 latent factors. FA model: X = LF + ε, where L is the loading matrix, F are latent factors (N(0,1)), and ε is unique noise per variable. PCA compresses; FA models the causal structure.
| Aspect | PCA | Factor Analysis |
|---|---|---|
| Goal | Compress variance | Model latent structure |
| Model | X = PC (deterministic) | X = LF + ε (probabilistic) |
| Uniqueness | None (all explained by PCs) | Present (ε - specific noise) |
| Interpretability | Components = mathematical constructs | Factors = meaningful concepts |
| Rotation | Optional | Key tool for interpretation |
| Application | Dimensionality reduction | Psychometrics, surveys, genomics |
**History:** Factor analysis was developed by psychologist Charles Spearman in 1904 to study intelligence. The g-factor (general intelligence) is the first and most famous application of FA. Today FA is used in psychometrics, marketing (hidden buyer motivations), neuroscience, and genomics.
A researcher wants to understand which latent personality traits underlie responses to 50 questionnaire items. Which method is more appropriate?
Factor Loadings and Uniqueness
A **factor loading** is the correlation between an observed variable and a latent factor. A high loading (|l| > 0.5) means the variable is a strong indicator of that factor. **Communality** h² is the fraction of the variable's variance explained by all factors: h² = Σl². **Uniqueness** ψ = 1 − h² is the part of variance specific to that variable.
**FA prerequisites:** 1. KMO > 0.6 (sampling adequacy) 2. Bartlett's test significant (p < 0.05) - non-trivial correlations exist 3. n ≥ 5 × p (at least 5 observations per variable, ideally 10×) 4. metric data (or polytomous ordinal). For binary data - ordinal FA or IRT.
The factor loading of 'Anxiety' on F1 = 0.82, on F2 = 0.12. Communality = 0.69. What does this mean?
Rotations: Varimax and Promax
The initial FA solution is not unique - any rotation of the factor space fits equally well. **Rotation** turns the axes to maximise interpretability. **Varimax** (orthogonal) - factors remain uncorrelated; loadings are polarised (pushed toward 0 or 1). **Promax** (oblique) - factors may correlate, which is more realistic for psychological constructs.
**Which rotation to choose?** Varimax (orthogonal): when one assume factors are independent (e.g., speed and accuracy are distinct abilities). Promax (oblique): when factors may correlate (anxiety and depression are related). In practice: start with Varimax; if the model is hard to interpret, switch to Promax and inspect the factor correlation matrix.
After Varimax rotation one see: 'Vocabulary', 'Reading', 'Grammar' load highly on F1; 'Matrices', 'Figure Rotation', 'Spatial Relations' load highly on F2. How do one name the factors?
Key Ideas
- FA model: X = LF + ε (observed = loadings × factors + uniqueness)
- PCA compresses variance; FA finds latent variables generating correlations
- Loading = correlation between a variable and a factor
- Communality h² = explained variance fraction; uniqueness = 1 − h²
- Varimax (orthogonal rotation): polarises loadings for interpretability
- Promax (oblique): for correlated factors (more realistic in social sciences)
- Kaiser criterion (λ > 1) and scree plot - for choosing the number of factors
FA and Related Methods
FA is related to PCA (both reduce dimensionality), ICA (independent components - a non-linear extension), SEM (structural equation modelling), and LDA (latent semantic analysis of text).
- PCA — FA is a probabilistic extension of PCA with a latent structure model
- Bayesian Statistics — Bayesian FA allows priors on loadings and the number of factors
Вопросы для размышления
- Why does rotation not change model fit (log-likelihood) but improves interpretability?
- Take a public psychological dataset (IPIP personality, Big Five). Apply FA with 5 factors and Varimax. Do the five personality traits reproduce?
- Which is better for analysing a questionnaire: FA or PCA? When will the results agree, and when will they diverge?