HDBSCAN

中級

2.5.6

HDBSCAN

Last updated 2025-11-05 Read time 5 min
Summary
  • HDBSCAN generalises DBSCAN by exploring a hierarchy of density thresholds, allowing clusters with different densities and automatically labelling low-density samples as noise.
  • Two knobs control the behaviour: min_cluster_size (smallest meaningful cluster) and min_samples (how strict the core definition is). Samples with label -1 are treated as noise.
  • The algorithm builds a spanning tree using mutual reachability distances, condenses it, and ranks clusters by persistence to decide which survive.
  • HDBSCAN integrates nicely with UMAP or other manifold learners: run UMAP for dimensionality reduction, then cluster the embeddings with HDBSCAN to detect organic groups.

Intuition #

This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.

Detailed Explanation #

1. Overview #

Classic DBSCAN relies on a single eps radius; picking a value that suits every region is tricky when densities differ. HDBSCAN removes eps and instead sweeps all radius values, constructing a hierarchy of clusters. From that hierarchy it extracts the most persistent clusters—those that remain intact across a wide range of density levels.

Key parameters:

  • min_cluster_size: minimum number of points that makes a cluster worth keeping.
  • min_samples: how many neighbours are needed for a point to be core (defaults to min_cluster_size). Larger values make the algorithm more conservative.
  • cluster_selection_method: "eom" (excess of mass) prefers stable clusters, whereas "leaf" keeps the finest-grained leaves.

2. Core distance and mutual reachability #

For each point (x) we compute the core distance

$$ d_{\mathrm{core}}(x) = \text{distance to the } \texttt{min\_samples}\text{-th nearest neighbour}. $$

The mutual reachability distance between (x) and (y) is

$$ d_{\mathrm{mreach}}(x, y) = \max\left\{ d_{\mathrm{core}}(x),\; d_{\mathrm{core}}(y),\; \lVert x - y \rVert \right\}. $$

HDBSCAN builds a minimum spanning tree over these distances, condenses the hierarchy, and assigns each cluster a persistence score (integrated stability). Clusters with high persistence are reported; samples that never belong to a persistent component are labelled noise ((-1)).

3. Python example #

Use the hdbscan library (install via pip install hdbscan) to cluster a two-moons dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import matplotlib.pyplot as plt
import hdbscan
from sklearn.datasets import make_moons

X, _ = make_moons(n_samples=400, noise=0.08, random_state=42)

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=20,
    min_samples=10,
    cluster_selection_method="eom",
)
labels = clusterer.fit_predict(X)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="tab10", s=20)
plt.title("HDBSCAN clustering result")
plt.xlabel("feature 1")
plt.ylabel("feature 2")
plt.grid(alpha=0.2)
plt.show()

print("Cluster persistence:", clusterer.cluster_persistence_)
print("Noise points:", np.sum(labels == -1))

cluster_persistence_ indicates which clusters are the most stable; if persistence is low, consider increasing min_cluster_size or reducing noise in the data.

4. Practical tips #

  • Standardise features before clustering—distance-based methods are sensitive to scale.
  • If you want more clusters, decrease min_cluster_size or switch cluster_selection_method="leaf".
  • To remove tiny noisy fragments, leave min_cluster_size high and keep min_samples modest (e.g., 5–10).
  • Combine with UMAP: fit UMAP to project high-dimensional data to 2–10 dimensions, then run HDBSCAN on the embedding for robust unsupervised pipelines.

FAQ #

What is HDBSCAN? #

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that extends DBSCAN by exploring a full hierarchy of density levels. Instead of requiring a fixed radius eps, HDBSCAN sweeps all radii and extracts the most persistent clusters — those that remain stable across a wide range of densities. This makes it more robust to varying cluster densities and easier to tune.

What is the difference between HDBSCAN and DBSCAN? #

DBSCANHDBSCAN
Key parametereps + min_samplesmin_cluster_size + min_samples
Varying densityStrugglesHandles naturally
StabilitySensitive to eps choiceMore robust across datasets
OutputHard cluster labelsLabels + soft membership probabilities
ComplexityO(n log n)O(n log n)

HDBSCAN is generally the better default for real-world data; DBSCAN is simpler and sufficient when density is uniform.

How do I choose min_cluster_size? #

min_cluster_size is the most important parameter — it defines the smallest cluster worth keeping. A cluster must contain at least this many points to survive the condensation step. Practical guidance:

  • Start with min_cluster_size = max(5, n_samples // 100) as a rough baseline.
  • Larger values produce fewer, denser clusters.
  • Smaller values allow tiny clusters but may increase noise or over-fragment.
  • Inspect cluster_persistence_ to assess stability; low persistence suggests the cluster is unreliable.

What does the noise label (−1) mean in HDBSCAN? #

Points labelled −1 are noise — they did not belong to any sufficiently persistent cluster. This is by design: HDBSCAN treats low-density regions as noise rather than forcing every point into a cluster (unlike k-means). A high noise fraction may indicate:

  • min_cluster_size is too large.
  • The data genuinely has many outliers.
  • Features need better preprocessing or dimensionality reduction (e.g., UMAP before HDBSCAN).

Can I use HDBSCAN with scikit-learn? #

Yes. From scikit-learn 1.3, sklearn.cluster.HDBSCAN is available:

1
2
3
4
from sklearn.cluster import HDBSCAN

hdb = HDBSCAN(min_cluster_size=20, min_samples=10)
labels = hdb.fit_predict(X)

The hdbscan package (pip install hdbscan) is an older alternative with additional features such as soft clustering (clusterer.probabilities_) and condensed tree visualisations.

5. References #

  • Campello, R. J. G. B., Moulavi, D., Zimek, A., & Sander, J. (2013). Density-Based Clustering Based on Hierarchical Density Estimates. PAKDD.
  • McInnes, L., Healy, J., & Astels, S. (2017). hdbscan: Hierarchical Density Based Clustering. Journal of Open Source Software.
  • scikit-learn developers. (2024). Clustering. https://scikit-learn.org/stable/modules/clustering.html