Cosine similarity and distance

Created: 2019-05-18 Last updated: 2020-04-22 Read time: 2 min

まとめ

Cosine similarity measures the closeness of vectors via the angle between them.
Compute cosine similarity/distance in Python for embeddings or sparse TF‑IDF vectors.
Review normalisation, zero vectors, and other practical considerations.

1. Definition and intuition #

For vectors $\mathbf{a}, \mathbf{b}$:

$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| , |\mathbf{b}|} $$

Close to 1: pointing in the same direction (high similarity).
Around 0: orthogonal (unrelated).
Close to –1: opposite direction.
Distance version: $d = 1 - \cos(\theta)$.

Because the magnitude is normalised out, cosine focuses on direction rather than length.

2. Python example #

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances

embeddings = np.array(
    [
        [0.1, 0.4, 0.5],
        [0.2, 0.2, 0.6],
        [0.6, 0.3, 0.1],
    ]
)

sim_matrix = cosine_similarity(embeddings)
dist_matrix = cosine_distances(embeddings)

print(sim_matrix.round(3))
print(dist_matrix.round(3))

For a single pair, scipy.spatial.distance.cosine is convenient. Sparse matrices work efficiently with cosine similarity in scikit-learn.

3. Key characteristics #

Scale invariance: great for TF-IDF or embedding vectors with different magnitudes.
Sparse-friendly: robust when most entries are zero.
Negative features: interpret carefully; centring or normalisation may be required.

4. Applications #

Search & recommendation: rank items by cosine similarity to a query or user profile.
Clustering topics: apply cosine distance in k-means (or spherical k-means) for text/topic grouping.
Embedding evaluation: compare similarity distributions between positive/negative pairs.

5. Practical notes #

Cosine similarity is undefined for zero vectors—drop them or add a small epsilon.
Cosine distance may not satisfy triangle inequality; verify if a metric-based algorithm requires it.
Combine with standardisation or dimensionality reduction when angles are sensitive to feature scaling.

Cosine similarity is a simple yet powerful measure for directional comparison; treat zero vectors and metric assumptions with care when plugging it into downstream pipelines.