まとめ
- Classify ranking metrics and explain their roles in recommendation, search, and advertising systems.
- Compare representative metrics such as NDCG, MAP, Recall@k, and Hit Rate with a worked example.
- Outline how to pick a metric set that aligns with business KPIs and how to report improvements.
Overview of ranking evaluation #
Ranking models assign scores to items so that the “best” appear first. In recommendation and search, the quality of the top results drives engagement and revenue, so evaluation must focus on those top positions. This chapter introduces the main metric families and shows how they complement one another.
Metric categories #
1. List-wise metrics #
- NDCG / DCG (NDCG): discounts gains logarithmically so that the top positions matter most.
- MAP (Mean Average Precision) (MAP): averages the precision of every relevant item; ideal when multiple hits per query are expected.
2. Top-k hit metrics #
- Recall@k (Recall@k): fraction of relevant items captured within the top k.
- Hit Rate / Hit@k (Hit Rate): whether at least one relevant item appears in the top k.
- Top-k Accuracy (Top-k Accuracy): for classifiers reused as top-k recommenders.
3. Pair-wise metrics #
- AUC (ranking): probability that a relevant item is ranked higher than an irrelevant one.
- Kendall’s τ / Spearman correlation: compare entire ordering structures.
Comparing ranking metrics #
import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import average_precision_score, ndcg_score
y_true = np.array(
[
[1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
[0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
]
)
y_score = np.array(
[
[0.9, 0.3, 0.2, 0.1, 0.6, 0.05, 0.03, 0.02, 0.01, 0.005],
[0.4, 0.8, 0.2, 0.1, 0.05, 0.6, 0.03, 0.02, 0.01, 0.005],
[0.2, 0.1, 0.85, 0.05, 0.03, 0.02, 0.7, 0.01, 0.005, 0.003],
]
)
def recall_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
top_k_idx = np.argsort(y_score_row)[::-1][:k]
positives = y_true_row.sum()
if positives == 0:
return 0.0
return y_true_row[top_k_idx].sum() / positives
def hit_rate_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
top_k_idx = np.argsort(y_score_row)[::-1][:k]
return float(y_true_row[top_k_idx].sum() > 0)
ks = [3, 5]
ndcg5 = ndcg_score(y_true, y_score, k=5)
map_score = np.mean(
[average_precision_score(t, s) for t, s in zip(y_true, y_score)]
)
recalls = [np.mean([recall_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]
hits = [np.mean([hit_rate_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]
metrics = {
"NDCG@5": ndcg5,
"MAP": map_score,
"Recall@3": recalls[0],
"Recall@5": recalls[1],
"Hit@3": hits[0],
"Hit@5": hits[1],
}
fig, ax = plt.subplots(figsize=(6.5, 3.8))
ax.bar(metrics.keys(), metrics.values(), color="#f97316", alpha=0.85)
ax.set_ylim(0, 1.05)
ax.set_ylabel("Score")
ax.set_title("Ranking metrics on sample recommendations")
ax.grid(axis="y", alpha=0.3)
plt.xticks(rotation=25)
plt.tight_layout()

Each metric emphasises a different quality: Recall/Hit measure coverage, while MAP and NDCG account for ordering. Use both coverage and ordering metrics to assess improvements.
Selecting metrics #
- Define the evaluation unit
Decide whether ranking quality is measured per query, user, or session. - Clarify what counts as relevant
Clicks, purchases, likes—each definition changes metric interpretation. - Align k with the UI
Match the cut-off with how many items the product actually displays. - Consider business weighting
Weight high-value items if some results matter more than others. - Connect offline metrics to online KPIs
Validate offline improvements with A/B tests and track the correlation.
Quick reference #
| Category | Metric | Related pages | Notes |
|---|---|---|---|
| List-wise | NDCG / DCG | NDCG | Rewards placing high-gain items early |
| List-wise | MAP | MAP | Suitable when multiple relevant items exist |
| Top-k | Recall@k / Hit@k | Recall@k / Hit Rate | Focuses on coverage within the displayed slate |
| Top-k | Top-k Accuracy | Top-k Accuracy | Bridges classification and ranking |
| Pair-wise | AUC | ROC-AUC | Probability a positive outranks a negative |
Checklist #
- Defined what constitutes a relevant item and the evaluation scope
- Selected k values that match the UI or business surface
- Reported both coverage (Recall/Hit) and ordering (NDCG/MAP) metrics
- Compared against baselines and monitored online impact
- Ensured evaluation data reflects current popularity trends