Machine Learning

Hierarchical Clustering

Suppose you're a biologist and want to classify 200 butterfly species by wing shape. How many groups should you define? 5? 10? 30? Or maybe butterflies group on multiple levels - species within genera, genera within families? K-Means demands an answer upfront, but hierarchical clustering builds a complete lineage tree and lets you choose the level of detail afterwards, by looking at the data.

  • **Phylogenetics:** biologists build evolutionary lineage trees of species using hierarchical clustering on DNA sequences - this is exactly how the "last universal common ancestor" of all living organisms (LUCA) was reconstructed
  • **Customer segmentation:** marketers use dendrograms for multi-level segmentation: at the top level "premium" and "mass market", within each - micro-segments by behavior, enabling communication to be tuned at the desired level of detail
  • **Gene clustering:** in bioinformatics hierarchical clustering with heatmaps is the standard tool for gene expression analysis - which genes activate together, which gene groups are associated with specific diseases

From numerical taxonomy to Ward's method

Hierarchical clustering came of age in biology, where the urge to build trees of life is centuries old. The turning point was 1963, when Robert Sokal and Peter Sneath published "Principles of Numerical Taxonomy", arguing that species should be grouped by measurable similarity computed from many traits at once, rather than by an expert's intuition. Their book gave biologists concrete algorithms for merging the most similar groups step by step, and it pushed clustering from a vague idea into a repeatable numerical procedure. The same year, statistician Joe Ward proposed a different merging rule in his paper on hierarchical grouping: at each step, join the two clusters whose union increases the total within-cluster variance the least. Ward's method remains the default choice in most software today, and the dendrograms that numerical taxonomy popularized now appear far beyond biology, in marketing, genomics, and document analysis.

Предварительные знания

  • K-Means Clustering

Agglomerative: Bottom-Up

In the previous lesson we studied K-Means, where you need to specify the number of clusters K upfront. But what if you don't know how many groups are in the data? Hierarchical clustering solves this problem: it builds a **complete tree of nested clusters** - from individual points to a single all-encompassing cluster - and allows you to choose the desired "cut" level after training.

**Agglomerative clustering** (from Latin agglomerare - "to gather into a heap") is the most popular variant of hierarchical clustering. It works **bottom-up**: starts with N individual clusters (each point is its own cluster) and at each step merges the two nearest clusters into one. The process repeats until a single cluster containing all points remains.

The key advantage: **the decision on the number of clusters can be deferred**. The algorithm builds the complete merge hierarchy, and you choose the cut level after the fact by analyzing the data structure. With K-Means, a wrong choice of K requires a complete restart of the algorithm.

**Algorithm complexity:** - **Time:** O(n^3) in the basic implementation (at each of n steps we find the minimum in an n x n distance matrix). With optimizations (min-heap) O(n^2 log n) is achievable. - **Memory:** O(n^2) to store the pairwise distance matrix. For comparison: K-Means runs in O(n * K * i), where i is the number of iterations (usually 10-50). So for large datasets (> 10,000 points) K-Means is significantly faster.

In practice, agglomerative clustering is used when the data size is moderate (up to ~10,000 points) and a **hierarchical structure** is specifically important - for example, species taxonomy in biology, customer segmentation by tiers, or grouping documents by topics with sub-topics.

Where does agglomerative clustering start?

Divisive: Top-Down

If agglomerative goes bottom-up, then **divisive clustering** works in the opposite direction - **top-down**. We start with one cluster containing all points and at each step split the least "cohesive" cluster into two. The process continues until every point is in its own cluster.

The most well-known divisive algorithm is **DIANA (DIvisive ANAlysis)**, proposed by Kaufman and Rousseeuw in 1990. At each step DIANA selects the cluster with the largest diameter (maximum distance between two points within the cluster) and splits it, trying to minimize the mean distance within the resulting sub-clusters.

The main problem with the divisive approach is the **computational cost of optimal splitting**. To split a cluster of m points into two optimally, you'd need to evaluate 2^(m-1) - 1 possible splits. For a cluster of 20 points that's already more than 500,000 options. Therefore DIANA uses a greedy heuristic rather than exhaustive search, which doesn't guarantee the optimal result.

In practice **agglomerative is used in 95%+ of cases**. Divisive can be useful if you know the data has a clear top-down hierarchical structure (e.g., splitting into "developed" and "developing" countries, then further subdivisions within each). But even in such cases agglomerative usually gives comparable results with lower computational cost.

**Key difference from K-Means:** both hierarchical clustering approaches - agglomerative and divisive - are **deterministic**. Given the same data they always produce the same result. K-Means, on the other hand, depends on the random initialization of centroids and can give different results on different runs.

Why is divisive clustering used significantly less often than agglomerative?

Dendrogram: Visualizing the Hierarchy

Hierarchical clustering produces not just a set of clusters, but a **complete merge tree**. To visualize this tree a **dendrogram** is used (from Greek dendron - "tree" and gramma - "record"). It's a diagram where each merge (or split) is shown as a connection of two branches, and the **height of the connection** represents the distance at which the merge occurred.

The dendrogram is the main tool for choosing the number of clusters. Look for a **large height jump** between consecutive merges: this means the merged clusters were far apart, i.e., a "natural boundary" of the data. A horizontal cut line at that level gives a meaningful partition.

**What is stored in the linkage matrix Z?** Each row of Z contains 4 values: - **Z[i, 0]** and **Z[i, 1]** - indices of the merged clusters - **Z[i, 2]** - distance between them at the time of merging - **Z[i, 3]** - number of points in the new cluster Indices 0..n-1 are the original points, indices n, n+1, ... are the new clusters formed by merges. This is a compact representation of the entire tree.

The dendrogram provides information that K-Means cannot: a **multi-level structure of the data**. For example, in biology species are grouped into genera, genera into families, families into orders - and the dendrogram shows this entire hierarchy in a single diagram. In marketing, customers are grouped into micro-segments, which aggregate into macro-segments, which form strategic groups.

**Large data trap:** a dendrogram with 10,000+ points becomes unreadable - the leaves merge into a solid bar. Solutions: 1. subsample down to 500-1000 points 2. use `truncate_mode='level'` in scipy to show only the top levels 3. compute clusters programmatically via `fcluster()` without visualization.

On a dendrogram two clusters merge at height 8.5, while all previous merges happened below height 3.0. What does this mean?

Linkage Methods

The key question in agglomerative clustering: how to measure the **distance between two clusters** when each contains multiple points? The answer - the **linkage method** - directly affects the shape of the resulting clusters. The same dataset with different linkage methods can yield completely different results.

**Single linkage** measures the distance between the two nearest points from different clusters. This leads to the **"chaining effect"**: if there is a "bridge" of a few intermediate points between two clouds of points, single linkage will merge those clouds into one cluster by extending a chain through the bridge. This is useful for detecting clusters of arbitrary shape (e.g., two crescents), but works poorly with noisy data.

**Complete linkage** measures the distance between the two farthest points from different clusters. This gives **compact, roughly spherical clusters** and is robust to noise, but works poorly with elongated shapes. **Average linkage** takes the mean of all pairwise distances - a compromise between single and complete.

**How to choose the linkage method?** - **Ward** - safe default choice. Produces compact clusters of roughly equal size. Works like a "hierarchical K-Means". - **Complete** - when compact clusters are needed but Ward doesn't fit (e.g., a non-Euclidean distance metric is required). - **Single** - when clusters have complex shapes (chains, crescents, rings). But sensitive to noise. - **Average** - the middle ground when you're not sure.

Hierarchical clustering is always better than K-Means because it doesn't require specifying K in advance and provides a dendrogram

Hierarchical clustering and K-Means solve different tasks: hierarchical provides deep understanding of data structure on small samples, while K-Means scales to millions of points

For n > 10,000 points hierarchical clustering requires O(n^2) memory and O(n^2 log n) time, making it impractical. K-Means with O(n*K) memory and linear time remains the primary tool for large data. The choice of method depends on the task: need hierarchy and interpretability - hierarchical; need speed on large data - K-Means

Which linkage method is best for detecting clusters shaped like crescents or chains?

Key Ideas

  • **Agglomerative (bottom-up)** - the primary method: starts with N clusters, at each step merges the two nearest; no need to specify K in advance
  • **Divisive (top-down)** - the reverse approach: starts with one cluster, sequentially splits it; rarely used due to exponential complexity of optimal splitting
  • **Dendrogram** - the main analysis tool: a merge tree where height = distance; a horizontal cut line determines the number of clusters; a large height jump marks a natural boundary
  • **Linkage method** determines cluster shape: single - for complex shapes (chains), Ward - for compact shapes (like K-Means), complete/average - compromises
  • **Hierarchical vs K-Means:** the lineage tree of 200 butterfly species we discussed at the start is exactly the kind of task where hierarchical clustering is irreplaceable, providing multi-level structure instead of a flat partition; but for millions of points K-Means remains faster

Related Topics

Hierarchical clustering is one of the key unsupervised learning methods. Next we'll explore other approaches to clustering and dimensionality reduction:

  • K-Means Clustering — The basic clustering method we already know - hierarchical clustering solves its main problem (choosing K) but trades off in scalability
  • DBSCAN — Another clustering approach that doesn't require K - finds clusters of arbitrary shape via point density and automatically identifies outliers
  • PCA (Principal Component Analysis) — Dimensionality reduction before clustering - useful when there are many features and pairwise distances lose meaning (curse of dimensionality)
  • K-NN (K Nearest Neighbors) — Uses the same idea of distance between points, but for classification (supervised) rather than clustering (unsupervised)

Вопросы для размышления

  • If agglomerative and divisive build the same hierarchy but in opposite directions - why can the results differ? Hint: think about the greediness of each decision at each step.
  • A company with 50,000 customers wants multi-level segmentation. Hierarchical clustering doesn't scale to such volumes. How could K-Means and hierarchical clustering be combined to get both speed and hierarchy?
  • Single linkage perfectly identifies two crescents but "breaks" on noisy data due to chaining. Can the approach be modified to retain the ability to find complex shapes while reducing sensitivity to noise?

Связанные уроки

  • ml-16-clustering-kmeans — K-Means is the baseline clustering method this extends
  • ml-18-dbscan — Density-based clustering, another way to avoid fixing K
  • ml-19-pca — Reduce dimensions before computing pairwise distances
  • la-02-dot-product — Distance metrics build on vector dot products
  • ds-01 — Dendrogram is literally a tree data structure
  • alg-17-mst
Hierarchical Clustering

0

1

Sign In