まとめ
- Cosine similarity measures the closeness of vectors via the angle between them.
- Compute cosine similarity/distance in Python for embeddings or sparse TF‑IDF vectors.
- Review normalisation, zero vectors, and other practical considerations.
1. Definition and intuition #
For vectors \(\mathbf{a}, \mathbf{b}\):
$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{|\mathbf{a}| , |\mathbf{b}|} $$
- Close to 1: pointing in the same direction (high similarity).
- Around 0: orthogonal (unrelated).
- Close to –1: opposite direction.
- Distance version: \(d = 1 - \cos(\theta)\).
Because the magnitude is normalised out, cosine focuses on direction rather than length.
2. Python example #
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
embeddings = np.array(
[
[0.1, 0.4, 0.5],
[0.2, 0.2, 0.6],
[0.6, 0.3, 0.1],
]
)
sim_matrix = cosine_similarity(embeddings)
dist_matrix = cosine_distances(embeddings)
print(sim_matrix.round(3))
print(dist_matrix.round(3))
For a single pair, scipy.spatial.distance.cosine is convenient. Sparse matrices work efficiently with cosine similarity in scikit-learn.
3. Key characteristics #
- Scale invariance: great for TF-IDF or embedding vectors with different magnitudes.
- Sparse-friendly: robust when most entries are zero.
- Negative features: interpret carefully; centring or normalisation may be required.
4. Applications #
- Search & recommendation: rank items by cosine similarity to a query or user profile.
- Clustering topics: apply cosine distance in k-means (or spherical k-means) for text/topic grouping.
- Embedding evaluation: compare similarity distributions between positive/negative pairs.
5. Practical notes #
- Cosine similarity is undefined for zero vectors—drop them or add a small epsilon.
- Cosine distance may not satisfy triangle inequality; verify if a metric-based algorithm requires it.
- Combine with standardisation or dimensionality reduction when angles are sensitive to feature scaling.
Cosine similarity is a simple yet powerful measure for directional comparison; treat zero vectors and metric assumptions with care when plugging it into downstream pipelines.