4.4.4
Cosine Similarity and Distance | Comparing vector directions
- Cosine similarity measures the closeness of vectors via the angle between them.
- Compute cosine similarity/distance in Python for embeddings or sparse TF‑IDF vectors.
- Review normalisation, zero vectors, and other practical considerations.
1. Definition and intuition #
For vectors \(\mathbf{a}, \mathbf{b}\):
$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} $$- Close to 1: pointing in the same direction (high similarity).
- Around 0: orthogonal (unrelated).
- Close to –1: opposite direction.
- Distance version: \(d = 1 - \cos(\theta)\).
Because the magnitude is normalised out, cosine focuses on direction rather than length.
2. Python example #
| |
For a single pair, scipy.spatial.distance.cosine is convenient. Sparse matrices work efficiently with cosine similarity in scikit-learn.
3. Key characteristics #
- Scale invariance: great for TF-IDF or embedding vectors with different magnitudes.
- Sparse-friendly: robust when most entries are zero.
- Negative features: interpret carefully; centring or normalisation may be required.
4. Applications #
- Search & recommendation: rank items by cosine similarity to a query or user profile.
- Clustering topics: apply cosine distance in k-means (or spherical k-means) for text/topic grouping.
- Embedding evaluation: compare similarity distributions between positive/negative pairs.
5. Practical notes #
- Cosine similarity is undefined for zero vectors—drop them or add a small epsilon.
- Cosine distance may not satisfy triangle inequality; verify if a metric-based algorithm requires it.
- Combine with standardisation or dimensionality reduction when angles are sensitive to feature scaling.
Cosine similarity is a simple yet powerful measure for directional comparison; treat zero vectors and metric assumptions with care when plugging it into downstream pipelines.
FAQ #
What is cosine similarity? #
Cosine similarity measures the cosine of the angle between two vectors. It ranges from −1 to 1:
$$ \cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \, \|\mathbf{b}\|} $$A value near 1 means the vectors point in nearly the same direction (highly similar); near 0 means they are orthogonal (unrelated); near −1 means they point in opposite directions. Because it ignores magnitude, it is especially useful when you care about direction (topic, meaning, orientation) rather than scale.
What are the main use cases of cosine similarity? #
- Information retrieval: rank documents by similarity to a query using TF-IDF or embedding vectors.
- Recommendation systems: find items or users whose preference vectors are most aligned.
- NLP / semantic search: compare sentence embeddings from models like BERT or OpenAI embeddings.
- Clustering: use cosine distance in spherical k-means or agglomerative clustering for text data.
- Duplicate detection: identify near-duplicate documents or product listings.
- Anomaly detection: flag items whose vector direction deviates sharply from the norm.
What is the difference between cosine similarity and Euclidean distance? #
Euclidean distance measures absolute spatial separation; cosine similarity measures angular separation (ignoring magnitude). For normalised vectors (unit length), they are monotonically related — but on raw vectors they behave differently:
- A short document and a long document covering the same topic have low Euclidean distance only after normalisation, but high cosine similarity already on raw TF vectors.
- Choose Euclidean distance when magnitude matters (e.g., signal intensity). Choose cosine similarity when only direction (topic, sentiment, orientation) matters.
How do I compute cosine similarity in Python? #
| |
For sparse TF-IDF matrices, sklearn.metrics.pairwise.cosine_similarity handles scipy.sparse matrices efficiently.
Why can cosine similarity be negative, and is that a problem? #
Cosine similarity is negative when the angle between vectors exceeds 90°. In NLP with TF-IDF or count vectors (all non-negative entries), scores stay between 0 and 1. With dense embeddings that include negative components (e.g., word2vec, GloVe), negative similarity is possible and simply means the vectors represent opposite concepts or directions. It is not a problem — it carries valid information. If your algorithm requires non-negative similarities, apply max(0, cosine_similarity) or rescale to the 0–1 range.