Cosine similarity and distance

Eval

Cosine similarity and distance

まとめ
  • Cosine similarity measures the closeness of vectors via the angle between them.
  • Compute cosine similarity/distance in Python for embeddings or sparse TF‑IDF vectors.
  • Review normalisation, zero vectors, and other practical considerations.

1. Definition and intuition #

For vectors \(\mathbf{a}, \mathbf{b}\):

$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| , |\mathbf{b}|} $$

  • Close to 1: pointing in the same direction (high similarity).
  • Around 0: orthogonal (unrelated).
  • Close to –1: opposite direction.
  • Distance version: \(d = 1 - \cos(\theta)\).

Because the magnitude is normalised out, cosine focuses on direction rather than length.


2. Python example #

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

embeddings = np.array(
    [
        [0.1, 0.4, 0.5],
        [0.2, 0.2, 0.6],
        [0.6, 0.3, 0.1],
    ]
)

sim_matrix = cosine_similarity(embeddings)
dist_matrix = cosine_distances(embeddings)

print(sim_matrix.round(3))
print(dist_matrix.round(3))

For a single pair, scipy.spatial.distance.cosine is convenient. Sparse matrices work efficiently with cosine similarity in scikit-learn.


3. Key characteristics #

  • Scale invariance: great for TF-IDF or embedding vectors with different magnitudes.
  • Sparse-friendly: robust when most entries are zero.
  • Negative features: interpret carefully; centring or normalisation may be required.

4. Applications #

  • Search & recommendation: rank items by cosine similarity to a query or user profile.
  • Clustering topics: apply cosine distance in k-means (or spherical k-means) for text/topic grouping.
  • Embedding evaluation: compare similarity distributions between positive/negative pairs.

5. Practical notes #

  • Cosine similarity is undefined for zero vectors—drop them or add a small epsilon.
  • Cosine distance may not satisfy triangle inequality; verify if a metric-based algorithm requires it.
  • Combine with standardisation or dimensionality reduction when angles are sensitive to feature scaling.

Cosine similarity is a simple yet powerful measure for directional comparison; treat zero vectors and metric assumptions with care when plugging it into downstream pipelines.