Machine Learning

PCA and Dimensionality Reduction

A dataset with 10,000 features. The model trains for hours, overfits, and you can't visualize the data - how do you understand what's happening in a 10,000-dimensional space? It turns out that most of these features duplicate each other, and the real structure of the data lives in a much lower-dimensional space. PCA, t-SNE and UMAP are three tools that find this hidden structure and compress the data without significant information loss.

  • **Genomics:** expression of 20,000 genes per cell. PCA compresses to 50 components for cell type clustering, and UMAP visualizes millions of cells in 2D - this is how new subtypes of cancer cells were discovered
  • **Recommendation systems:** Netflix has millions of users and thousands of movies. PCA (via SVD) compresses the rating matrix to 100-200 latent factors - "likes action", "prefers European cinema" - and builds recommendations on them
  • **NLP:** word embeddings (Word2Vec, GloVe) live in 300-dimensional space. t-SNE and UMAP visualize them in 2D, showing semantic clusters: countries near capitals, verbs near verbs

Lines of closest fit, then principal components

PCA was invented twice, from two different angles. In 1901 Karl Pearson, the British statistician better known for the correlation coefficient, published "On Lines and Planes of Closest Fit to Systems of Points in Space". His question was geometric: which line or plane minimizes the perpendicular distance to a cloud of points? The answer, it turned out, is exactly the direction of maximum variance, the first principal axis. Pearson lacked the computing tools to use the method on real data, so it stayed mostly a curiosity. Three decades later, in 1933, the American economist Harold Hotelling rederived the technique from a statistical standpoint, asking how to summarize many correlated variables with a few uncorrelated ones. It was Hotelling who named them "principal components" and connected the whole construction to the eigenvectors of the covariance matrix, the form taught today. Two researchers, geometry and statistics, the same axes of maximum spread.

Предварительные знания

  • Math for ML

Eigenvalues and Principal Components

A dataset with 200 features: height, weight, blood pressure, blood sugar, cholesterol... Training a model on 200 features is expensive, many correlate with each other (height and leg length, weight and waist size), and visualizing a 200-dimensional space is impossible. **PCA (Principal Component Analysis)** solves this: it finds new coordinate axes along which the data is stretched maximally, and discards axes where the data barely varies.

The key idea of PCA: if a cloud of points in 2D is elongated at a 45-degree angle, the axis of maximum variance coincides with neither x nor y. PCA finds a **new coordinate system** where the first axis (PC1) runs along the maximum "stretch" of the data, the second (PC2) is perpendicular to the first and captures the maximum of the remaining variance, and so on. These new axes are called **principal components**.

Mathematically, PCA works through the **covariance matrix**. The covariance matrix shows how each pair of features is related. Diagonal elements are the variance of each feature, off-diagonal elements are the covariance between pairs. PCA finds the **eigenvectors** of this matrix - they define the directions of the principal components. And **eigenvalues** show how much variance falls on each direction. The larger the eigenvalue - the more important the component.

**Standardization is mandatory before PCA!** If features are on different scales (height in cm: 150-200, salary in dollars: 30000-300000), PCA will "think" salary is more important simply because the numbers are larger. StandardScaler brings all features to mean 0 and variance 1 - after that PCA compares directions fairly.

Why is it necessary to standardize data before PCA?

Explained Variance: How Much Information is Retained

We've learned to project data onto principal components, but the key question remains: **how many components should we keep?** If the data is 100-dimensional and we take 2 components - we might lose 80% of the information. Or 2 components might retain 95%. The answer is given by the **explained variance ratio** - the fraction of total variance explained by each component.

The eigenvalue of each component shows how much variance it captures. **Explained variance ratio** is the eigenvalue divided by the sum of all eigenvalues. If we have 4 components with eigenvalues [2.93, 0.93, 0.14, 0.02], then PC1 explains 2.93 / (2.93 + 0.93 + 0.14 + 0.02) = **72.8%** of variance. PC2 adds another **23.0%**. Together - **95.8%**. So 2 components out of 4 retain almost all the information.

**Rules for choosing n_components:** - **Visualization:** n_components=2 or 3 (for plots) - **Variance threshold:** n_components=0.95 - sklearn will automatically choose the minimum number of components retaining 95% - **Elbow method:** build a scree plot, look for the "elbow" - the point after which variance gain drops sharply - **Practice:** for preprocessing before ML usually 95-99%; for visualization - 2-3 components

A real-world example: the MNIST dataset (handwritten digits) has 784 features (28x28 pixels). PCA with a 95% threshold compresses them to **~150 components** - 5x fewer. A classifier on 150 PCA features works almost as accurately as on 784, but trains much faster. For visualization (2 components) - clusters of digits are already directly distinguishable, even though only ~25% of variance is retained.

A dataset has 50 features. PCA shows cumulative explained variance: PC1=40%, PC1+PC2=65%, PC1-PC5=88%, PC1-PC10=96%. How many components should be chosen for a 95% threshold?

t-SNE: Visualization in 2D

PCA excels at compressing data for preprocessing, but it has a fundamental limitation: **PCA is a linear method**. It projects data onto a plane, and if clusters in the original space are separated non-linearly (e.g., one cluster surrounds another), PCA will mix them. For **visualizing** complex multidimensional structures, non-linear methods were developed, and the most popular is **t-SNE** (t-distributed Stochastic Neighbor Embedding, van der Maaten & Hinton, 2008).

The idea of t-SNE: preserve the **local structure** of the data. If points A and B are close neighbors in 100-dimensional space, they should remain neighbors on the 2D plot. t-SNE builds a probability distribution of "who is whose neighbor" in the original space and tries to reproduce the same distribution in 2D. Close points attract each other, distant ones repel. The result is clear clusters where the data structure is visible to the naked eye.

**Perplexity - the key hyperparameter of t-SNE:** - Intuition: "how many nearest neighbors to consider" when building probabilities - Small perplexity (5-10): focus on very local structure, small dense clusters - Large perplexity (30-50): considers more neighbors, more global picture - Rule: perplexity must be less than the number of points. For 1000 points: 5-50, for 100: 5-20 - Recommended to try several values (5, 15, 30, 50) and compare results

**Limitations of t-SNE - why it can't be used for preprocessing:** - **Instability:** different runs give different results (depends on random_state) - **No transform():** can't project new points - only fit_transform() on all data at once - **Distances on the plot are misleading:** distance between clusters doesn't reflect the real distance in the original space - **Slow:** O(n^2) in time and memory, on 100K+ points takes minutes to hours - **Only 2-3D:** designed exclusively for visualization, not for compression

Why isn't t-SNE suitable for preprocessing data before training an ML model?

UMAP and Practical Tips

In 2018 Leland McInnes introduced **UMAP (Uniform Manifold Approximation and Projection)** - an algorithm that solves the main problems of t-SNE. UMAP is significantly faster (from O(n^2) in t-SNE to approximately O(n*log(n)) in UMAP), better preserves the **global structure** of the data (distances between clusters are meaningful) and - critically - has a **transform()** method for new data.

**UMAP hyperparameters:** - **n_neighbors** (default=15): analogous to perplexity in t-SNE. Small value (5) - focus on micro-structure, large value (50) - on macro-structure. Start with 15. - **min_dist** (default=0.1): minimum distance between points on the 2D plot. min_dist=0.0 - dense clusters, min_dist=0.5 - sparse, "spread out" clusters. - **metric** (default='euclidean'): distance metric. Can use 'cosine', 'manhattan' and others - more flexible than PCA.

Another application of PCA: **denoising**. The idea is simple - noise is usually random and spread across all directions, while the useful signal is concentrated in a few principal components. Projecting data onto the first k components and reconstructing back to the original space removes noise while preserving the signal. This works because the last components (with small eigenvalues) contain mostly noise.

PCA, t-SNE and UMAP do the same thing - reduce dimensionality - and any of them can be used for any task

Each method has its own niche: PCA - for preprocessing and denoising (linear, stable, with transform()), t-SNE - for visualizing small datasets (non-linear, but slow and without transform()), UMAP - for visualizing any datasets and non-linear preprocessing (fast, with transform())

PCA is linear and doesn't separate non-linear clusters when visualizing. t-SNE is unstable and has no transform() - it can't be applied to test data separately. UMAP is a compromise but less interpretable than PCA. The choice of method depends on the task: preprocessing, visualization, or denoising

You have a dataset of 500K points and 300 features. You need to visualize clusters in 2D. Which method is best?

Key Ideas

  • **PCA** finds the directions of maximum variance (eigenvectors of the covariance matrix) and projects data onto them. Eigenvalues show the importance of each direction. Standardization is mandatory
  • **Explained variance ratio** determines how much information each component retains. The 95% threshold or the elbow method help choose the optimal number of components
  • **t-SNE** is a non-linear method for visualization: separates clusters well, but is slow (O(n^2)), unstable, and has no transform(). Only for visualization, not for preprocessing
  • **UMAP** is a fast alternative to t-SNE with transform() and preserved global structure. PCA for preprocessing, UMAP for large dataset visualization - this is exactly how genomics visualizes millions of cells and NLP shows the structure of word embeddings, as mentioned in the intro

Related Topics

Dimensionality reduction is closely related to the math basics of ML, feature space methods, and data preparation:

  • Mathematical Foundations of ML — PCA relies on linear algebra - eigenvectors and values, matrix decompositions. These concepts are covered in detail in the math foundations lesson
  • k-NN (k Nearest Neighbors) — k-NN suffers from the curse of dimensionality - in high-dimensional space all points are equidistant. PCA reduces dimensionality and makes k-NN effective
  • K-Means Clustering — PCA + K-Means is a classic combination: first PCA reduces dimensionality and removes noise, then K-Means finds clusters in the compressed space
  • Feature Engineering — PCA is one method of automatic feature engineering. It creates new features (principal components) as linear combinations of the original ones

Вопросы для размышления

  • PCA finds linear combinations of features. What happens if the real data structure is non-linear - for example, points lie on the surface of a sphere? Can PCA correctly reduce the dimensionality?
  • t-SNE visualizes clusters well, but distances between clusters on the plot have no meaning. How could this mislead data analysis? Give a specific example.
  • If PCA for denoising discards the last components as 'noisy', what happens if those components contain an important but weak signal - for example, a rare anomaly in the data?

Связанные уроки

  • ml-03-math-foundations — PCA rests on eigenvalues and SVD from the math basics
  • ml-42-feature-engineering — Reduced components feed cleaner features downstream
  • ml-16-clustering-kmeans — Project to fewer dimensions before clustering
  • la-15-svd — SVD is the direct algebraic engine behind PCA
  • stat-14-pca — Statistics frames PCA via covariance and variance
PCA and Dimensionality Reduction

0

1

Sign In