4.5.3
Recall@k and Precision@k
Summary
- Recall@k measures the proportion of relevant items included within the top-k results.
- We compute Recall@k and Precision@k from a recommendation list to evaluate its coverage and purity.
- Interpretation varies depending on the number of ground truths and candidate items; we review key design considerations.
1. Definition #
For a query \(q\) with a set of relevant items \(G_q\) and a top-k candidate set \(S_{q,k}\):
$$ \mathrm{Recall@k} = \frac{|G_q \cap S_{q,k}|}{|G_q|} $$$$ \mathrm{Precision@k} = \frac{|G_q \cap S_{q,k}|}{k} $$- Recall@k: How many of the relevant items were retrieved.
- Precision@k: How many of the retrieved items were actually relevant.
2. Python Implementation #
| |
y_true contains binary relevance labels (0/1), and y_score contains model-predicted scores.
For multiple queries, compute the mean over all samples.
3. Choosing k #
- Set k according to UI or serving constraints (e.g., top-5 recommendations → Recall@5).
- Evaluate multiple cutoffs (Recall@5, Recall@10, etc.) to analyze coverage trends.
4. Practical Applications #
- Recommendation systems: Check whether the items a user actually selects appear in the recommendation list.
- Advertising: Measure how many clicked or converted ads were included in the top-k impressions.
- A/B testing: Track Recall@k alongside online metrics to confirm whether offline improvements translate to user behavior.
5. Trade-off with Precision@k #
- Increasing Recall@k typically lowers Precision@k, as more items are retrieved.
- The ideal model maximizes both under a fixed k.
- Use F1@k or MAP to balance recall and precision when evaluating ranking models.
Summary #
- Recall@k measures coverage, while Precision@k measures purity.
- Define k clearly and evaluate multiple metrics to capture ranking quality comprehensively.
- Relating Recall@k and Precision@k to online KPIs (e.g., CTR, CVR) helps quantify business impact.