Distance

Eval

Distance

まとめ
  • Organise distance and similarity measures by typical use case.
  • Compare representative distances for vectors, probability distributions, and optimal transport with code examples.
  • Highlight the preprocessing and dimensionality-reduction considerations that influence distance behaviour.

Chapter 4 #

Distance and similarity measures #

Distances quantify how far apart two items are; similarities do the opposite. They underpin clustering, recommendation, anomaly detection, and generative-model evaluation. Each distance assumes certain data properties, so matching the metric to the data (and the downstream algorithm) is essential.


Main categories #

1. Vector-space distances #

  • Euclidean distance: intuitive straight-line distance; highly sensitive to feature scaling.
  • Cosine similarity / distance (Cosine similarity): compares vector direction, ideal for TF-IDF or embedding spaces.
  • Manhattan / Chebyshev distances: L1 and L∞ norms; useful for sparse vectors or robust comparisons.

2. Probability-distribution distances #

  • Kullback–Leibler divergence (KL divergence): relative entropy; asymmetric and sensitive to zero probabilities.
  • Jensen–Shannon divergence (Jensen–Shannon): symmetric, finite variant of KL; square root yields a metric.
  • Hellinger distance (Hellinger): square-root transform provides symmetry and the triangle inequality.

3. Optimal-transport based #

  • Wasserstein distance (Wasserstein): accounts for both location and shape differences; popular for generative models and drift detection.
  • Sinkhorn distance: entropic-regularised optimal transport for faster computation.

Comparing vector distances #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics.pairwise import cosine_distances, euclidean_distances

embeddings = np.array(
    [
        [0.2, 0.4, 0.4],
        [0.1, 0.9, 0.0],
        [0.6, 0.2, 0.2],
        [0.3, 0.1, 0.6],
        [0.05, 0.45, 0.5],
    ]
)
labels = ["A", "B", "C", "D", "E"]

cos_dist = cosine_distances(embeddings)
euc_dist = euclidean_distances(embeddings)

fig, axes = plt.subplots(1, 2, figsize=(7.5, 3.3))
for ax, matrix, title in zip(
    axes,
    (cos_dist, euc_dist),
    ("Cosine distance", "Euclidean distance"),
):
    im = ax.imshow(matrix, cmap="viridis")
    ax.set_xticks(range(len(labels)))
    ax.set_yticks(range(len(labels)))
    ax.set_xticklabels(labels)
    ax.set_yticklabels(labels)
    ax.set_title(title)
    for i in range(len(labels)):
        for j in range(len(labels)):
            ax.text(j, i, f"{matrix[i, j]:.2f}", ha="center", va="center", color="white")
fig.colorbar(im, ax=axes.ravel().tolist(), shrink=0.9, label="Distance")
plt.tight_layout()
Heatmaps of cosine and Euclidean distance

The notion of “nearest neighbour” changes with the distance. Cosine focuses on direction (B and E are close), while Euclidean emphasises magnitude differences.


Choosing a distance #

  1. Check feature scaling
    Standardise or normalise when magnitudes differ across features.
  2. Consider sparsity
    Sparse text or recommender data often works better with cosine distance.
  3. Identify distributional needs
    Use KL/Jensen–Shannon/Hellinger for probability vectors; mind support mismatches.
  4. Assess shape versus location
    Wasserstein captures both shifts and spread—valuable for generative models and drift.
  5. Balance accuracy and cost
    High-dimensional or large datasets may require approximations (LSH, Sinkhorn).

Quick reference #

CategoryMeasureTypical useNotes
VectorCosine similarityText embeddings, TF-IDF, sparse vectorsHandle zero vectors carefully
VectorEuclidean / L1 / L∞Clustering on continuous featuresFeature scaling is critical
DistributionKL divergenceCompare model vs. data distributionsAsymmetric; undefined with zero support
DistributionJensen–ShannonSymmetric comparison of probability vectorsSquare root becomes a metric
DistributionHellinger distanceBayesian updates, drift monitoringBin/normalisation choices matter
Optimal transportWasserstein distanceGenerative model evaluation, anomaly detectionComputationally heavy; consider Sinkhorn

Checklist #

  • Clarified whether the inputs are vectors or distributions
  • Verified the assumptions (symmetry, triangle inequality) required by the downstream algorithm
  • Evaluated the impact of normalisation or dimensionality reduction
  • Considered approximate methods when exact distance is too expensive
  • Visualised distance changes to ensure they align with intuition