Recall@k and Precision@k

Last updated 2020-02-26 Read time 2 min

Summary

Recall@k measures the proportion of relevant items included within the top-k results.
We compute Recall@k and Precision@k from a recommendation list to evaluate its coverage and purity.
Interpretation varies depending on the number of ground truths and candidate items; we review key design considerations.

1. Definition #

For a query $q$ with a set of relevant items $G_q$ and a top-k candidate set $S_{q,k}$:

$$ \mathrm{Recall@k} = \frac{|G_q \cap S_{q,k}|}{|G_q|} $$

$$ \mathrm{Precision@k} = \frac{|G_q \cap S_{q,k}|}{k} $$

Recall@k: How many of the relevant items were retrieved.
Precision@k: How many of the retrieved items were actually relevant.

2. Python Implementation #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np

def recall_at_k(y_true: np.ndarray, y_score: np.ndarray, k: int) -> float:
    """Compute Recall@k — proportion of true positives within top-k."""
    idx = np.argsort(-y_score)[:k]
    return float(y_true[idx].sum() / y_true.sum())

def precision_at_k(y_true: np.ndarray, y_score: np.ndarray, k: int) -> float:
    """Compute Precision@k — fraction of top-k items that are relevant."""
    idx = np.argsort(-y_score)[:k]
    return float(y_true[idx].sum() / k)

y_true contains binary relevance labels (0/1), and y_score contains model-predicted scores.
For multiple queries, compute the mean over all samples.

3. Choosing k #

Set k according to UI or serving constraints (e.g., top-5 recommendations → Recall@5).
Evaluate multiple cutoffs (Recall@5, Recall@10, etc.) to analyze coverage trends.

4. Practical Applications #

Recommendation systems: Check whether the items a user actually selects appear in the recommendation list.
Advertising: Measure how many clicked or converted ads were included in the top-k impressions.
A/B testing: Track Recall@k alongside online metrics to confirm whether offline improvements translate to user behavior.

5. Trade-off with Precision@k #

Increasing Recall@k typically lowers Precision@k, as more items are retrieved.
The ideal model maximizes both under a fixed k.
Use F1@k or MAP to balance recall and precision when evaluating ranking models.

Summary #

Recall@k measures coverage, while Precision@k measures purity.
Define k clearly and evaluate multiple metrics to capture ranking quality comprehensively.
Relating Recall@k and Precision@k to online KPIs (e.g., CTR, CVR) helps quantify business impact.