まとめ

Classify ranking metrics and explain their roles in recommendation, search, and advertising systems.
Compare representative metrics such as NDCG, MAP, Recall@k, and Hit Rate with a worked example.
Outline how to pick a metric set that aligns with business KPIs and how to report improvements.

Overview of ranking evaluation #

Ranking models assign scores to items so that the “best” appear first. In recommendation and search, the quality of the top results drives engagement and revenue, so evaluation must focus on those top positions. This chapter introduces the main metric families and shows how they complement one another.

Metric categories #

1. List-wise metrics #

NDCG / DCG (NDCG): discounts gains logarithmically so that the top positions matter most.
MAP (Mean Average Precision) (MAP): averages the precision of every relevant item; ideal when multiple hits per query are expected.

2. Top-k hit metrics #

Recall@k (Recall@k): fraction of relevant items captured within the top k.
Hit Rate / Hit@k (Hit Rate): whether at least one relevant item appears in the top k.
Top-k Accuracy (Top-k Accuracy): for classifiers reused as top-k recommenders.

3. Pair-wise metrics #

AUC (ranking): probability that a relevant item is ranked higher than an irrelevant one.
Kendall’s τ / Spearman correlation: compare entire ordering structures.

Comparing ranking metrics #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.metrics import average_precision_score, ndcg_score

y_true = np.array(
    [
        [1, 0, 0, 0, 1, 0, 0, 0, 0, 0],
        [0, 1, 0, 0, 0, 1, 0, 0, 0, 0],
        [0, 0, 1, 0, 0, 0, 1, 0, 0, 0],
    ]
)
y_score = np.array(
    [
        [0.9, 0.3, 0.2, 0.1, 0.6, 0.05, 0.03, 0.02, 0.01, 0.005],
        [0.4, 0.8, 0.2, 0.1, 0.05, 0.6, 0.03, 0.02, 0.01, 0.005],
        [0.2, 0.1, 0.85, 0.05, 0.03, 0.02, 0.7, 0.01, 0.005, 0.003],
    ]
)

def recall_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
    top_k_idx = np.argsort(y_score_row)[::-1][:k]
    positives = y_true_row.sum()
    if positives == 0:
        return 0.0
    return y_true_row[top_k_idx].sum() / positives

def hit_rate_at_k(y_true_row: np.ndarray, y_score_row: np.ndarray, k: int) -> float:
    top_k_idx = np.argsort(y_score_row)[::-1][:k]
    return float(y_true_row[top_k_idx].sum() > 0)

ks = [3, 5]
ndcg5 = ndcg_score(y_true, y_score, k=5)
map_score = np.mean(
    [average_precision_score(t, s) for t, s in zip(y_true, y_score)]
)
recalls = [np.mean([recall_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]
hits = [np.mean([hit_rate_at_k(t, s, k) for t, s in zip(y_true, y_score)]) for k in ks]

metrics = {
    "NDCG@5": ndcg5,
    "MAP": map_score,
    "Recall@3": recalls[0],
    "Recall@5": recalls[1],
    "Hit@3": hits[0],
    "Hit@5": hits[1],
}

fig, ax = plt.subplots(figsize=(6.5, 3.8))
ax.bar(metrics.keys(), metrics.values(), color="#f97316", alpha=0.85)
ax.set_ylim(0, 1.05)
ax.set_ylabel("Score")
ax.set_title("Ranking metrics on sample recommendations")
ax.grid(axis="y", alpha=0.3)
plt.xticks(rotation=25)
plt.tight_layout()

Bar chart of ranking metrics — Each metric emphasises a different quality: Recall/Hit measure coverage, while MAP and NDCG account for ordering. Use both coverage and ordering metrics to assess improvements.

Selecting metrics #

Define the evaluation unit
Decide whether ranking quality is measured per query, user, or session.
Clarify what counts as relevant
Clicks, purchases, likes—each definition changes metric interpretation.
Align k with the UI
Match the cut-off with how many items the product actually displays.
Consider business weighting
Weight high-value items if some results matter more than others.
Connect offline metrics to online KPIs
Validate offline improvements with A/B tests and track the correlation.

Quick reference #

Category	Metric	Related pages	Notes
List-wise	NDCG / DCG	NDCG	Rewards placing high-gain items early
List-wise	MAP	MAP	Suitable when multiple relevant items exist
Top-k	Recall@k / Hit@k	Recall@k / Hit Rate	Focuses on coverage within the displayed slate
Top-k	Top-k Accuracy	Top-k Accuracy	Bridges classification and ranking
Pair-wise	AUC	ROC-AUC	Probability a positive outranks a negative

Checklist #

Defined what constitutes a relevant item and the evaluation scope
Selected k values that match the UI or business surface
Reported both coverage (Recall/Hit) and ordering (NDCG/MAP) metrics
Compared against baselines and monitored online impact
Ensured evaluation data reflects current popularity trends

Ranking Metrics