Recall@k and Precision@k

Eval

Recall@k and Precision@k

Created: Last updated: Read time: 2 min
まとめ
  • Recall@k measures the proportion of relevant items included within the top-k results.
  • We compute Recall@k and Precision@k from a recommendation list to evaluate its coverage and purity.
  • Interpretation varies depending on the number of ground truths and candidate items; we review key design considerations.

1. Definition #

For a query \(q\) with a set of relevant items \(G_q\) and a top-k candidate set \(S_{q,k}\):

$$ \mathrm{Recall@k} = \frac{|G_q \cap S_{q,k}|}{|G_q|} $$ $$ \mathrm{Precision@k} = \frac{|G_q \cap S_{q,k}|}{k} $$

  • Recall@k: How many of the relevant items were retrieved.
  • Precision@k: How many of the retrieved items were actually relevant.

2. Python Implementation #

import numpy as np

def recall_at_k(y_true: np.ndarray, y_score: np.ndarray, k: int) -> float:
    """Compute Recall@k — proportion of true positives within top-k."""
    idx = np.argsort(-y_score)[:k]
    return float(y_true[idx].sum() / y_true.sum())

def precision_at_k(y_true: np.ndarray, y_score: np.ndarray, k: int) -> float:
    """Compute Precision@k — fraction of top-k items that are relevant."""
    idx = np.argsort(-y_score)[:k]
    return float(y_true[idx].sum() / k)

y_true contains binary relevance labels (0/1), and y_score contains model-predicted scores.
For multiple queries, compute the mean over all samples.


3. Choosing k #

  • Set k according to UI or serving constraints (e.g., top-5 recommendations → Recall@5).
  • Evaluate multiple cutoffs (Recall@5, Recall@10, etc.) to analyze coverage trends.

4. Practical Applications #

  • Recommendation systems: Check whether the items a user actually selects appear in the recommendation list.
  • Advertising: Measure how many clicked or converted ads were included in the top-k impressions.
  • A/B testing: Track Recall@k alongside online metrics to confirm whether offline improvements translate to user behavior.

5. Trade-off with Precision@k #

  • Increasing Recall@k typically lowers Precision@k, as more items are retrieved.
  • The ideal model maximizes both under a fixed k.
  • Use F1@k or MAP to balance recall and precision when evaluating ranking models.

Summary #

  • Recall@k measures coverage, while Precision@k measures purity.
  • Define k clearly and evaluate multiple metrics to capture ranking quality comprehensively.
  • Relating Recall@k and Precision@k to online KPIs (e.g., CTR, CVR) helps quantify business impact.