4.4
Distance
Summary
- Organise distance and similarity measures by typical use case.
- Compare representative distances for vectors, probability distributions, and optimal transport with code examples.
- Highlight the preprocessing and dimensionality-reduction considerations that influence distance behaviour.
Distance and similarity measures #
Distances quantify how far apart two items are; similarities do the opposite. They underpin clustering, recommendation, anomaly detection, and generative-model evaluation. Each distance assumes certain data properties, so matching the metric to the data (and the downstream algorithm) is essential.
Main categories #
1. Vector-space distances #
- Euclidean distance: intuitive straight-line distance; highly sensitive to feature scaling.
- Cosine similarity / distance (Cosine similarity): compares vector direction, ideal for TF-IDF or embedding spaces.
- Manhattan / Chebyshev distances: L1 and L∞ norms; useful for sparse vectors or robust comparisons.
2. Probability-distribution distances #
- Kullback–Leibler divergence (KL divergence): relative entropy; asymmetric and sensitive to zero probabilities.
- Jensen–Shannon divergence (Jensen–Shannon): symmetric, finite variant of KL; square root yields a metric.
- Hellinger distance (Hellinger): square-root transform provides symmetry and the triangle inequality.
3. Optimal-transport based #
- Wasserstein distance (Wasserstein): accounts for both location and shape differences; popular for generative models and drift detection.
- Sinkhorn distance: entropic-regularised optimal transport for faster computation.
Comparing vector distances #
| |

The notion of “nearest neighbour” changes with the distance. Cosine focuses on direction (B and E are close), while Euclidean emphasises magnitude differences.
Choosing a distance #
- Check feature scaling
Standardise or normalise when magnitudes differ across features. - Consider sparsity
Sparse text or recommender data often works better with cosine distance. - Identify distributional needs
Use KL/Jensen–Shannon/Hellinger for probability vectors; mind support mismatches. - Assess shape versus location
Wasserstein captures both shifts and spread—valuable for generative models and drift. - Balance accuracy and cost
High-dimensional or large datasets may require approximations (LSH, Sinkhorn).
Quick reference #
| Category | Measure | Typical use | Notes |
|---|---|---|---|
| Vector | Cosine similarity | Text embeddings, TF-IDF, sparse vectors | Handle zero vectors carefully |
| Vector | Euclidean / L1 / L∞ | Clustering on continuous features | Feature scaling is critical |
| Distribution | KL divergence | Compare model vs. data distributions | Asymmetric; undefined with zero support |
| Distribution | Jensen–Shannon | Symmetric comparison of probability vectors | Square root becomes a metric |
| Distribution | Hellinger distance | Bayesian updates, drift monitoring | Bin/normalisation choices matter |
| Optimal transport | Wasserstein distance | Generative model evaluation, anomaly detection | Computationally heavy; consider Sinkhorn |
Checklist #
- Clarified whether the inputs are vectors or distributions
- Verified the assumptions (symmetry, triangle inequality) required by the downstream algorithm
- Evaluated the impact of normalisation or dimensionality reduction
- Considered approximate methods when exact distance is too expensive
- Visualised distance changes to ensure they align with intuition