- HDBSCAN generalises DBSCAN by exploring a hierarchy of density thresholds, allowing clusters with different densities and automatically labelling low-density samples as noise.
- Two knobs control the behaviour:
min_cluster_size(smallest meaningful cluster) andmin_samples(how strict the core definition is). Samples with label-1are treated as noise. - The algorithm builds a spanning tree using mutual reachability distances, condenses it, and ranks clusters by persistence to decide which survive.
- HDBSCAN integrates nicely with UMAP or other manifold learners: run UMAP for dimensionality reduction, then cluster the embeddings with HDBSCAN to detect organic groups.
1. Overview #
Classic DBSCAN relies on a single eps radius; picking a value that suits every region is tricky when densities differ. HDBSCAN removes eps and instead sweeps all radius values, constructing a hierarchy of clusters. From that hierarchy it extracts the most persistent clusters—those that remain intact across a wide range of density levels.
Key parameters:
min_cluster_size: minimum number of points that makes a cluster worth keeping.min_samples: how many neighbours are needed for a point to be core (defaults tomin_cluster_size). Larger values make the algorithm more conservative.cluster_selection_method:"eom"(excess of mass) prefers stable clusters, whereas"leaf"keeps the finest-grained leaves.
2. Core distance and mutual reachability #
For each point (x) we compute the core distance
$$ d_{\mathrm{core}}(x) = \text{distance to the } \texttt{min_samples}\text{-th nearest neighbour}. $$
The mutual reachability distance between (x) and (y) is
$$ d_{\mathrm{mreach}}(x, y) = \max\left{ d_{\mathrm{core}}(x),; d_{\mathrm{core}}(y),; \lVert x - y \rVert \right}. $$
HDBSCAN builds a minimum spanning tree over these distances, condenses the hierarchy, and assigns each cluster a persistence score (integrated stability). Clusters with high persistence are reported; samples that never belong to a persistent component are labelled noise ((-1)).
3. Python example #
Use the hdbscan library (install via pip install hdbscan) to cluster a two-moons dataset:
import numpy as np
import matplotlib.pyplot as plt
import hdbscan
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=400, noise=0.08, random_state=42)
clusterer = hdbscan.HDBSCAN(
min_cluster_size=20,
min_samples=10,
cluster_selection_method="eom",
)
labels = clusterer.fit_predict(X)
plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", s=20)
plt.title("HDBSCAN clustering result")
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.grid(alpha=0.2)
plt.show()
print("Cluster persistence:", clusterer.cluster_persistence_)
print("Noise points:", np.sum(labels == -1))
cluster_persistence_ indicates which clusters are the most stable; if persistence is low, consider increasing min_cluster_size or reducing noise in the data.
4. Practical tips #
- Standardise features before clustering—distance-based methods are sensitive to scale.
- If you want more clusters, decrease
min_cluster_sizeor switchcluster_selection_method="leaf". - To remove tiny noisy fragments, leave
min_cluster_sizehigh and keepmin_samplesmodest (e.g., 5–10). - Combine with UMAP: fit UMAP to project high-dimensional data to 2–10 dimensions, then run HDBSCAN on the embedding for robust unsupervised pipelines.
5. References #
- Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD.
- McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical Density Based Clustering. Journal of Open Source Software.
- scikit-learn developers. (2024). Clustering. https://scikit-learn.org/stable/modules/clustering.html