k-means++

2.5.2

k-means++

Last updated 2020-02-12 Read time 1 min
Summary
  • k-means++ spreads the initial centroids apart, reducing the chance that vanilla k-means converges to a poor local optimum.
  • Additional centroids are sampled with probability proportional to the squared distance from the existing centroids, discouraging tight clusters of seeds.
  • In scikit-learn, KMeans(init="k-means++") activates the method, making it easy to compare with purely random initialisation.
  • Large-scale variants such as mini-batch k-means build on k-means++ and are common in streaming or big-data settings.

Intuition #

This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.

Detailed Explanation #