Ranking Metrics

Eval

Ranking Metrics

まとめ
  • Classify ranking metrics and explain their roles in recommendation, search, and advertising systems.
  • Compare representative metrics such as NDCG, MAP, Recall@k, and Hit Rate with a worked example.
  • Outline how to pick a metric set that aligns with business KPIs and how to report improvements.

Overview of ranking evaluation #

Ranking models assign scores to items so that the “best” appear first. In recommendation and search, the quality of the top results drives engagement and revenue, so evaluation must focus on those top positions. This chapter introduces the main metric families and shows how they complement one another.


Metric categories #

1. List-wise metrics #

  • NDCG / DCG (NDCG): discounts gains logarithmically so that the top positions matter most.
  • MAP (Mean Average Precision) (MAP): averages the precision of every relevant item; ideal when multiple hits per query are expected.

2. Top-k hit metrics #

  • Recall@k (Recall@k): fraction of relevant items captured within the top k.
  • Hit Rate / Hit@k (Hit Rate): whether at least one relevant item appears in the top k.
  • Top-k Accuracy (Top-k Accuracy): for classifiers reused as top-k recommenders.

3. Pair-wise metrics #

  • AUC (ranking): probability that a relevant item is ranked higher than an irrelevant one.
  • Kendall’s τ / Spearman correlation: compare entire ordering structures.

Comparing ranking metrics #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import average_precision_score, ndcg_score

y_true = np.array(
    [
        [1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
    ]
)
y_score = np.array(
    [
        [0.9, 0.3, 0.2, 0.1, 0.6, 0.05, 0.03, 0.02, 0.01, 0.005],
        [0.4, 0.8, 0.2, 0.1, 0.05, 0.6, 0.03, 0.02, 0.01, 0.005],
        [0.2, 0.1, 0.85, 0.05, 0.03, 0.02, 0.7, 0.01, 0.005, 0.003],
    ]
)

def recall_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
    top_k_idx = np.argsort(y_score_row)[::-1][:k]
    positives = y_true_row.sum()
    if positives == 0:
        return 0.0
    return y_true_row[top_k_idx].sum() / positives

def hit_rate_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
    top_k_idx = np.argsort(y_score_row)[::-1][:k]
    return float(y_true_row[top_k_idx].sum() > 0)

ks = [3, 5]
ndcg5 = ndcg_score(y_true, y_score, k=5)
map_score = np.mean(
    [average_precision_score(t, s) for t, s in zip(y_true, y_score)]
)
recalls = [np.mean([recall_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]
hits = [np.mean([hit_rate_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]

metrics = {
    "NDCG@5": ndcg5,
    "MAP": map_score,
    "Recall@3": recalls[0],
    "Recall@5": recalls[1],
    "Hit@3": hits[0],
    "Hit@5": hits[1],
}

fig, ax = plt.subplots(figsize=(6.5, 3.8))
ax.bar(metrics.keys(), metrics.values(), color="#f97316", alpha=0.85)
ax.set_ylim(0, 1.05)
ax.set_ylabel("Score")
ax.set_title("Ranking metrics on sample recommendations")
ax.grid(axis="y", alpha=0.3)
plt.xticks(rotation=25)
plt.tight_layout()
Bar chart of ranking metrics

Each metric emphasises a different quality: Recall/Hit measure coverage, while MAP and NDCG account for ordering. Use both coverage and ordering metrics to assess improvements.


Selecting metrics #

  1. Define the evaluation unit
    Decide whether ranking quality is measured per query, user, or session.
  2. Clarify what counts as relevant
    Clicks, purchases, likes—each definition changes metric interpretation.
  3. Align k with the UI
    Match the cut-off with how many items the product actually displays.
  4. Consider business weighting
    Weight high-value items if some results matter more than others.
  5. Connect offline metrics to online KPIs
    Validate offline improvements with A/B tests and track the correlation.

Quick reference #

CategoryMetricRelated pagesNotes
List-wiseNDCG / DCGNDCGRewards placing high-gain items early
List-wiseMAPMAPSuitable when multiple relevant items exist
Top-kRecall@k / Hit@kRecall@k / Hit RateFocuses on coverage within the displayed slate
Top-kTop-k AccuracyTop-k AccuracyBridges classification and ranking
Pair-wiseAUCROC-AUCProbability a positive outranks a negative

Checklist #

  • Defined what constitutes a relevant item and the evaluation scope
  • Selected k values that match the UI or business surface
  • Reported both coverage (Recall/Hit) and ordering (NDCG/MAP) metrics
  • Compared against baselines and monitored online impact
  • Ensured evaluation data reflects current popularity trends