Statistics
Principal Component Analysis (PCA)
'50,000 genes need population structure analysis' or '1,000 user metrics - what matters?' - the human brain perceives at most 3 dimensions. PCA compresses thousands of features into 2-3 components, preserving maximum information and making the invisible visible.
- Genomics: human population structure from millions of SNP markers visualised in seconds
- Face recognition: eigenfaces - the first PCA components of face images
- Finance: principal components of stock returns = hidden risk factors of the market
- Netflix: dimensionality reduction on the user-movie matrix for recommendation systems
- MRI: data volume compression while retaining diagnostic information
Предварительные знания
The Curse of Dimensionality
The **curse of dimensionality** is the phenomenon where, as the number of features grows, data becomes sparse in the high-dimensional space, distances lose meaning, and models overfit. With 100 features, covering just 1% of the space would require a sample of size 10^198 - impossible. We need dimensionality reduction: discard noise, keep signal.
**Multicollinearity and PCA:** when features are highly correlated (e.g., height and weight), information is duplicated. PCA finds uncorrelated directions of maximum variance - this removes both redundancy and noise.
A dataset has 500 observations and 300 features. Only 20 features are genuinely informative. What happens to an ML model without preprocessing?
The PCA Algorithm: Eigenvectors of the Covariance Matrix
**PCA (Principal Component Analysis)** finds orthogonal axes (principal components) along which data variance is maximised. Steps: 1. centre the data (subtract the mean) 2. compute the covariance matrix Σ 3. find eigenvectors and eigenvalues of Σ 4. sort by descending eigenvalues 5. project data onto the top-k eigenvectors.
**Scaling is mandatory!** PCA maximises variance, so features with large numerical ranges dominate. Always apply `StandardScaler` before PCA: `Pipeline([('scaler', StandardScaler()), ('pca', PCA(n_components=20))])`.
Data: height (cm) ranging from 150-200 and weight (kg) ranging from 50-120. PCA is applied without scaling. What happens?
Choosing the Number of Components: Explained Variance
The key question in PCA: how many components to keep? The **scree plot** shows eigenvalues in descending order - look for the 'elbow' where the decline flattens. **90-95% rule:** keep enough components to explain 90-95% of the total variance. **Task context:** for visualisation - 2-3 components; for ML preprocessing - 90-95%.
**PCA applications:** image compression (eigenfaces in face recognition), genomics (population structure from SNP markers), visualisation of high-dimensional data, ML preprocessing (noise reduction, faster training), financial modelling (portfolio risk factors).
A scree plot shows: PC1 = 45%, PC2 = 30%, PC3 = 10%, PC4 = 8%, and <2% each thereafter. How many components should be kept for visualisation vs. ML preprocessing?
Key Ideas
- PCA finds orthogonal axes of maximum variance (principal components)
- Algorithm: centring → covariance matrix → eigenvectors
- Always scale data (StandardScaler) before applying PCA
- Scree plot and cumulative explained variance are the tools for choosing k
- Rule of thumb: 90-95% variance for ML preprocessing, 2-3 components for visualisation
- PCA is compression without a latent structure model (unlike Factor Analysis)
PCA and Related Methods
PCA is the foundational dimensionality reduction method. Factor Analysis (FA) seeks latent variables that generate the data. t-SNE and UMAP are non-linear alternatives for visualisation. Autoencoders are neural-network PCA.
- Factor Analysis — Extension of PCA: modelling latent variables with rotation
- Correlation Analysis — PCA on the correlation matrix = PCA with scaling
Вопросы для размышления
- Why are principal components orthogonal? What does that mean in terms of correlation between them?
- Take a dataset with numeric features (e.g., Boston Housing or Iris). How many components are needed for 90% variance? What do the first 2 components represent?
- PCA reduces dimensionality and can improve an ML model. But when would PCA hurt classification performance?