4.4.4
Cosine similarity and distance
Summary
- Cosine similarity measures the closeness of vectors via the angle between them.
- Compute cosine similarity/distance in Python for embeddings or sparse TF‑IDF vectors.
- Review normalisation, zero vectors, and other practical considerations.
1. Definition and intuition #
For vectors \(\mathbf{a}, \mathbf{b}\):
$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} $$- Close to 1: pointing in the same direction (high similarity).
- Around 0: orthogonal (unrelated).
- Close to –1: opposite direction.
- Distance version: \(d = 1 - \cos(\theta)\).
Because the magnitude is normalised out, cosine focuses on direction rather than length.
2. Python example #
| |
For a single pair, scipy.spatial.distance.cosine is convenient. Sparse matrices work efficiently with cosine similarity in scikit-learn.
3. Key characteristics #
- Scale invariance: great for TF-IDF or embedding vectors with different magnitudes.
- Sparse-friendly: robust when most entries are zero.
- Negative features: interpret carefully; centring or normalisation may be required.
4. Applications #
- Search & recommendation: rank items by cosine similarity to a query or user profile.
- Clustering topics: apply cosine distance in k-means (or spherical k-means) for text/topic grouping.
- Embedding evaluation: compare similarity distributions between positive/negative pairs.
5. Practical notes #
- Cosine similarity is undefined for zero vectors—drop them or add a small epsilon.
- Cosine distance may not satisfy triangle inequality; verify if a metric-based algorithm requires it.
- Combine with standardisation or dimensionality reduction when angles are sensitive to feature scaling.
Cosine similarity is a simple yet powerful measure for directional comparison; treat zero vectors and metric assumptions with care when plugging it into downstream pipelines.