Chapter 5 #
Clustering #
Group similar observations to discover structure, summarize datasets, or power downstream tasks. Choose algorithms based on shape assumptions, robustness needs, and scalability.
Algorithms at a glance #
- k‑means / k‑means++ / X‑means: fast for spherical clusters; sensitive to scale and initialization.
- DBSCAN / HDBSCAN: density‑based; finds arbitrary shapes and outliers; needs sensible eps/minPts.
- Gaussian Mixture (GMM): probabilistic clusters; soft assignments and ellipsoids.
- Hierarchical clustering: dendrograms for multi‑scale structure; linkage matters.
Practice tips #
- Standardize features; use PCA/UMAP for visualization and to denoise.
- Pick k via silhouette, elbow, or stability across resamples.
- Validate with holdout labels when available; otherwise, report internal indices plus qualitative inspection.