まとめ
- Hierarchical clustering creates a tree (dendrogram) that shows how samples merge, so you can explore cluster structures at multiple resolutions.
- In the agglomerative setting every point starts as its own cluster and pairs are merged step by step according to a linkage criterion.
- Linkage options (
ward,complete,average,single) and distance metrics (euclidean,cosine, etc.) change the shape of the dendrogram. - You can visualise the hierarchy with
scipy.cluster.hierarchyand extract flat clusters with scikit-learn’sAgglomerativeClustering.
1. Overview #
Agglomerative clustering proceeds as follows:
- Treat each sample as its own cluster.
- Find the pair of clusters with the smallest distance under the chosen linkage rule.
- Merge the pair, record the merge height, and repeat until everything becomes one cluster.
Cutting the dendrogram at a particular height yields a flat clustering. Ward’s linkage tends to produce compact clusters, while single linkage often forms chain-like clusters.
2. Linkage criteria #
Let (C_i) and (C_j) be clusters with centroids (\mu_i) and (\mu_j). Common linkage rules:
- Ward: minimises the increase in within-cluster variance $$ \Delta = \frac{|C_i||C_j|}{|C_i| + |C_j|} \lVert \mu_i - \mu_j \rVert^2. $$
- Complete: distance between the farthest pair of points across clusters.
- Average: average pairwise distance between points in the two clusters.
- Single: distance between the closest pair of points (prone to chaining).
Pick the rule that matches the shapes you expect and scale your features beforehand.
3. Python example #
The snippet below builds a dendrogram with SciPy and then performs agglomerative clustering with scikit-learn:
import numpy as np
from sklearn.datasets import make_blobs
from scipy.cluster.hierarchy import dendrogram, linkage
import matplotlib.pyplot as plt
X, _ = make_blobs(n_samples=50, centers=3, cluster_std=0.7, random_state=42)
Z = linkage(X, method="ward")
plt.figure(figsize=(8, 4))
dendrogram(Z, truncate_mode="level", p=5)
plt.title("Hierarchical clustering with Ward linkage")
plt.xlabel("sample index or cluster ID")
plt.ylabel("distance")
plt.show()
from sklearn.cluster import AgglomerativeClustering
clustering = AgglomerativeClustering(
n_clusters=3,
linkage="average",
affinity="euclidean",
)
labels = clustering.fit_predict(X)
Set distance_threshold instead of n_clusters if you want to cut the dendrogram at a custom height.
4. References #
- Ward, J. H. (1963). Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association.
- Müllner, D. (2011). Modern Hierarchical, Agglomerative Clustering Algorithms. arXiv:1109.2378.
- scikit-learn developers. (2024). Hierarchical clustering. https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering