まとめ
- Organise distance and similarity measures by typical use case.
- Compare representative distances for vectors, probability distributions, and optimal transport with code examples.
- Highlight the preprocessing and dimensionality-reduction considerations that influence distance behaviour.
Chapter 4 #
Distance and similarity measures #
Distances quantify how far apart two items are; similarities do the opposite. They underpin clustering, recommendation, anomaly detection, and generative-model evaluation. Each distance assumes certain data properties, so matching the metric to the data (and the downstream algorithm) is essential.
Main categories #
1. Vector-space distances #
- Euclidean distance: intuitive straight-line distance; highly sensitive to feature scaling.
- Cosine similarity / distance (Cosine similarity): compares vector direction, ideal for TF-IDF or embedding spaces.
- Manhattan / Chebyshev distances: L1 and L∞ norms; useful for sparse vectors or robust comparisons.
2. Probability-distribution distances #
- Kullback–Leibler divergence (KL divergence): relative entropy; asymmetric and sensitive to zero probabilities.
- Jensen–Shannon divergence (Jensen–Shannon): symmetric, finite variant of KL; square root yields a metric.
- Hellinger distance (Hellinger): square-root transform provides symmetry and the triangle inequality.
3. Optimal-transport based #
- Wasserstein distance (Wasserstein): accounts for both location and shape differences; popular for generative models and drift detection.
- Sinkhorn distance: entropic-regularised optimal transport for faster computation.
Comparing vector distances #
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances
embeddings = np.array(
[
[0.2, 0.4, 0.4],
[0.1, 0.9, 0.0],
[0.6, 0.2, 0.2],
[0.3, 0.1, 0.6],
[0.05, 0.45, 0.5],
]
)
labels = ["A", "B", "C", "D", "E"]
cos_dist = cosine_distances(embeddings)
euc_dist = euclidean_distances(embeddings)
fig, axes = plt.subplots(1, 2, figsize=(7.5, 3.3))
for ax, matrix, title in zip(
axes,
(cos_dist, euc_dist),
("Cosine distance", "Euclidean distance"),
):
im = ax.imshow(matrix, cmap="viridis")
ax.set_xticks(range(len(labels)))
ax.set_yticks(range(len(labels)))
ax.set_xticklabels(labels)
ax.set_yticklabels(labels)
ax.set_title(title)
for i in range(len(labels)):
for j in range(len(labels)):
ax.text(j, i, f"{matrix[i, j]:.2f}", ha="center", va="center", color="white")
fig.colorbar(im, ax=axes.ravel().tolist(), shrink=0.9, label="Distance")
plt.tight_layout()

The notion of “nearest neighbour” changes with the distance. Cosine focuses on direction (B and E are close), while Euclidean emphasises magnitude differences.
Choosing a distance #
- Check feature scaling
Standardise or normalise when magnitudes differ across features. - Consider sparsity
Sparse text or recommender data often works better with cosine distance. - Identify distributional needs
Use KL/Jensen–Shannon/Hellinger for probability vectors; mind support mismatches. - Assess shape versus location
Wasserstein captures both shifts and spread—valuable for generative models and drift. - Balance accuracy and cost
High-dimensional or large datasets may require approximations (LSH, Sinkhorn).
Quick reference #
| Category | Measure | Typical use | Notes |
|---|---|---|---|
| Vector | Cosine similarity | Text embeddings, TF-IDF, sparse vectors | Handle zero vectors carefully |
| Vector | Euclidean / L1 / L∞ | Clustering on continuous features | Feature scaling is critical |
| Distribution | KL divergence | Compare model vs. data distributions | Asymmetric; undefined with zero support |
| Distribution | Jensen–Shannon | Symmetric comparison of probability vectors | Square root becomes a metric |
| Distribution | Hellinger distance | Bayesian updates, drift monitoring | Bin/normalisation choices matter |
| Optimal transport | Wasserstein distance | Generative model evaluation, anomaly detection | Computationally heavy; consider Sinkhorn |
Checklist #
- Clarified whether the inputs are vectors or distributions
- Verified the assumptions (symmetry, triangle inequality) required by the downstream algorithm
- Evaluated the impact of normalisation or dimensionality reduction
- Considered approximate methods when exact distance is too expensive
- Visualised distance changes to ensure they align with intuition