HDBSCAN

Basic

HDBSCAN | Hierarchical density-based clustering

Created: Last updated: Read time: 3 min
まとめ
  • HDBSCAN generalises DBSCAN by exploring a hierarchy of density thresholds, allowing clusters with different densities and automatically labelling low-density samples as noise.
  • Two knobs control the behaviour: min_cluster_size (smallest meaningful cluster) and min_samples (how strict the core definition is). Samples with label -1 are treated as noise.
  • The algorithm builds a spanning tree using mutual reachability distances, condenses it, and ranks clusters by persistence to decide which survive.
  • HDBSCAN integrates nicely with UMAP or other manifold learners: run UMAP for dimensionality reduction, then cluster the embeddings with HDBSCAN to detect organic groups.

1. Overview #

Classic DBSCAN relies on a single eps radius; picking a value that suits every region is tricky when densities differ. HDBSCAN removes eps and instead sweeps all radius values, constructing a hierarchy of clusters. From that hierarchy it extracts the most persistent clusters—those that remain intact across a wide range of density levels.

Key parameters:

  • min_cluster_size: minimum number of points that makes a cluster worth keeping.
  • min_samples: how many neighbours are needed for a point to be core (defaults to min_cluster_size). Larger values make the algorithm more conservative.
  • cluster_selection_method: "eom" (excess of mass) prefers stable clusters, whereas "leaf" keeps the finest-grained leaves.

2. Core distance and mutual reachability #

For each point (x) we compute the core distance

$$ d_{\mathrm{core}}(x) = \text{distance to the } \texttt{min_samples}\text{-th nearest neighbour}. $$

The mutual reachability distance between (x) and (y) is

$$ d_{\mathrm{mreach}}(x, y) = \max\left{ d_{\mathrm{core}}(x),; d_{\mathrm{core}}(y),; \lVert x - y \rVert \right}. $$

HDBSCAN builds a minimum spanning tree over these distances, condenses the hierarchy, and assigns each cluster a persistence score (integrated stability). Clusters with high persistence are reported; samples that never belong to a persistent component are labelled noise ((-1)).

3. Python example #

Use the hdbscan library (install via pip install hdbscan) to cluster a two-moons dataset:

import numpy as np
import matplotlib.pyplot as plt
import hdbscan
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=400, noise=0.08, random_state=42)

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=20,
    min_samples=10,
    cluster_selection_method="eom",
)
labels = clusterer.fit_predict(X)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", s=20)
plt.title("HDBSCAN clustering result")
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.grid(alpha=0.2)
plt.show()

print("Cluster persistence:", clusterer.cluster_persistence_)
print("Noise points:", np.sum(labels == -1))

cluster_persistence_ indicates which clusters are the most stable; if persistence is low, consider increasing min_cluster_size or reducing noise in the data.

4. Practical tips #

  • Standardise features before clustering—distance-based methods are sensitive to scale.
  • If you want more clusters, decrease min_cluster_size or switch cluster_selection_method="leaf".
  • To remove tiny noisy fragments, leave min_cluster_size high and keep min_samples modest (e.g., 5–10).
  • Combine with UMAP: fit UMAP to project high-dimensional data to 2–10 dimensions, then run HDBSCAN on the embedding for robust unsupervised pipelines.

5. References #

  • Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD.
  • McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical Density Based Clustering. Journal of Open Source Software.
  • scikit-learn developers. (2024). Clustering. https://scikit-learn.org/stable/modules/clustering.html