Statistics

Factor Analysis: Latent Variables

'What is intelligence, really?' - in 1904 Charles Spearman applied factor analysis to test scores and discovered the g-factor. Since then FA has revealed the structure of personality (Big Five), consumer preferences, and portfolio risk. FA is an X-ray of hidden reality.

  • Psychometrics: the Big Five personality traits (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) - the result of FA on thousands of adjectives
  • Marketing: hidden buyer motivations behind questionnaire responses
  • Genomics: haplogroups as latent factors of SNP markers
  • Finance: factor models of returns (Fama-French 5-factor model)
  • Neuroscience: independent components of fMRI signals (ICA)

Предварительные знания

  • Principal Component Analysis (PCA)

Latent Variables: What Lies Behind the Data

Word2Vec, BERT embeddings, and PCA are factor analysis under different names. OpenAI's ada-002 embeds 300M+ texts using 1536 latent factors. FA model: X = LF + ε, where L is the loading matrix, F are latent factors (N(0,1)), and ε is unique noise per variable. PCA compresses; FA models the causal structure.

AspectPCAFactor Analysis
GoalCompress varianceModel latent structure
ModelX = PC (deterministic)X = LF + ε (probabilistic)
UniquenessNone (all explained by PCs)Present (ε - specific noise)
InterpretabilityComponents = mathematical constructsFactors = meaningful concepts
RotationOptionalKey tool for interpretation
ApplicationDimensionality reductionPsychometrics, surveys, genomics

**History:** Factor analysis was developed by psychologist Charles Spearman in 1904 to study intelligence. The g-factor (general intelligence) is the first and most famous application of FA. Today FA is used in psychometrics, marketing (hidden buyer motivations), neuroscience, and genomics.

A researcher wants to understand which latent personality traits underlie responses to 50 questionnaire items. Which method is more appropriate?

Factor Loadings and Uniqueness

A **factor loading** is the correlation between an observed variable and a latent factor. A high loading (|l| > 0.5) means the variable is a strong indicator of that factor. **Communality** h² is the fraction of the variable's variance explained by all factors: h² = Σl². **Uniqueness** ψ = 1 − h² is the part of variance specific to that variable.

**FA prerequisites:** 1. KMO > 0.6 (sampling adequacy) 2. Bartlett's test significant (p < 0.05) - non-trivial correlations exist 3. n ≥ 5 × p (at least 5 observations per variable, ideally 10×) 4. metric data (or polytomous ordinal). For binary data - ordinal FA or IRT.

The factor loading of 'Anxiety' on F1 = 0.82, on F2 = 0.12. Communality = 0.69. What does this mean?

Rotations: Varimax and Promax

The initial FA solution is not unique - any rotation of the factor space fits equally well. **Rotation** turns the axes to maximise interpretability. **Varimax** (orthogonal) - factors remain uncorrelated; loadings are polarised (pushed toward 0 or 1). **Promax** (oblique) - factors may correlate, which is more realistic for psychological constructs.

**Which rotation to choose?** Varimax (orthogonal): when one assume factors are independent (e.g., speed and accuracy are distinct abilities). Promax (oblique): when factors may correlate (anxiety and depression are related). In practice: start with Varimax; if the model is hard to interpret, switch to Promax and inspect the factor correlation matrix.

After Varimax rotation one see: 'Vocabulary', 'Reading', 'Grammar' load highly on F1; 'Matrices', 'Figure Rotation', 'Spatial Relations' load highly on F2. How do one name the factors?

Key Ideas

  • FA model: X = LF + ε (observed = loadings × factors + uniqueness)
  • PCA compresses variance; FA finds latent variables generating correlations
  • Loading = correlation between a variable and a factor
  • Communality h² = explained variance fraction; uniqueness = 1 − h²
  • Varimax (orthogonal rotation): polarises loadings for interpretability
  • Promax (oblique): for correlated factors (more realistic in social sciences)
  • Kaiser criterion (λ > 1) and scree plot - for choosing the number of factors

FA and Related Methods

FA is related to PCA (both reduce dimensionality), ICA (independent components - a non-linear extension), SEM (structural equation modelling), and LDA (latent semantic analysis of text).

  • PCA — FA is a probabilistic extension of PCA with a latent structure model
  • Bayesian Statistics — Bayesian FA allows priors on loadings and the number of factors

Вопросы для размышления

  • Why does rotation not change model fit (log-likelihood) but improves interpretability?
  • Take a public psychological dataset (IPIP personality, Big Five). Apply FA with 5 factors and Varimax. Do the five personality traits reproduce?
  • Which is better for analysing a questionnaire: FA or PCA? When will the results agree, and when will they diverge?

Связанные уроки

  • la-15-svd
Factor Analysis: Latent Variables

0

1

Sign In