Spectral Clustering

Basic

Spectral Clustering | Eigenvector-based grouping

Created: Last updated: Read time: 2 min
まとめ
  • Spectral clustering uses the spectrum (eigenvalues/eigenvectors) of a similarity graph to embed data into a space where k-means can easily separate the clusters.
  • It handles non-linearly separable structures (e.g., concentric circles, moons) by capturing connectivity rather than raw Euclidean distance.
  • Key hyperparameters include the affinity definition (rbf, nearest-neighbour graph, cosine), the scale parameter (gamma or n_neighbors), and the label assignment strategy.
  • After computing the first (k) eigenvectors of the normalized Laplacian, row-normalise them and run k-means to obtain cluster labels.

1. Overview #

Given data points (x_i), we create an affinity matrix (W) whose entries reflect pairwise similarity. The (unnormalised) graph Laplacian is

$$ L = D - W, \quad D = \mathrm{diag}(d_1, \dots, d_n),; d_i = \sum_j W_{ij}. $$

Spectral clustering typically uses the symmetric normalised Laplacian

$$ L_{\mathrm{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} W D^{-1/2}. $$

Take the (k) eigenvectors associated with the smallest eigenvalues of (L_{\mathrm{sym}}), stack them into matrix (U), normalise each row, and run k-means on those rows. The resulting assignments translate back to the original samples.

2. Python example #

import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.datasets import make_circles
from sklearn.cluster import SpectralClustering

X, _ = make_circles(n_samples=500, factor=0.4, noise=0.05, random_state=0)

model = SpectralClustering(
    n_clusters=2,
    affinity="rbf",
    gamma=50,
    assign_labels="kmeans",
    random_state=0,
)
labels = model.fit_predict(X)

plt.figure(figsize=(6, 5))
plt.scatter(X[:, 0], X[:, 1], c=labels, cmap="coolwarm", s=20)
plt.title("Spectral clustering result")
plt.tight_layout()
plt.show()

Spectral clustering example

3. Practical notes #

  • For sparse graphs, use affinity="nearest_neighbors" and tune n_neighbors.
  • assign_labels="cluster_qr" can be more stable than the default k-means for high-dimensional embeddings.
  • Always standardise features before computing similarities, otherwise gamma/distance scales become meaningless.

4. References #

  • Shi, J., & Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE TPAMI.
  • Ng, A. Y., Jordan, M. I., & Weiss, Y. (2002). On Spectral Clustering: Analysis and an Algorithm. NeurIPS.
  • scikit-learn developers. (2024). Spectral clustering. https://scikit-learn.org/stable/modules/clustering.html#spectral-clustering