NDCG (Normalized Discounted Cumulative Gain)

Eval

NDCG (Normalized Discounted Cumulative Gain)

まとめ
  • NDCG is a ranking metric that evaluates results by normalizing the discounted cumulative gain.
  • It uses relevance scores to compute DCG/NDCG and shows how logarithmic discounting works.
  • We also review considerations for multi-level relevance and cascade models.

1. Definition #

Given a relevance score \(rel_i\) for rank \(i\), the Discounted Cumulative Gain (DCG) is defined as:

$$ \mathrm{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)} $$

To normalize this, we compute the DCG for the ideal ranking order (IDCG) and obtain:

$$ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $$


2. Computing in Python #

from sklearn.metrics import ndcg_score

# y_true: array of true relevance scores with shape (n_samples, n_labels)
# y_score: model output scores

score = ndcg_score(y_true, y_score, k=10)

print("NDCG@10:", round(score, 4))

ndcg_score accepts not only binary relevance (0/1) but also graded integer scores. The key is to properly define your ground-truth relevance matrix.


3. Hyperparameters #

  • k (cutoff): Choose @5, @10, etc., based on how many results are shown to users.
  • Relevance scale: Binary scores (0/1) work, but graded relevance levels (e.g., highly relevant, somewhat relevant) yield more nuanced evaluation.
  • Log base: While log base 2 is standard, using a different base only changes scaling, not the relative ranking.

4. Practical Applications #

  • Search evaluation: Commonly used to measure how well search results align with human-annotated relevance labels.
  • Recommendation systems: Treat implicit feedback (views, clicks, purchases) as relevance signals to track ranking improvements.
  • A/B testing: Combine NDCG with online metrics to understand how offline improvements translate to real-world performance.

5. Key Considerations #

  • Ground-truth labeling is expensive; implicit signals can be noisy.
  • In two-stage systems (candidate generation → ranking), use suitable metrics for each phase.
  • Combine NDCG with Recall@k, MAP, or other metrics to capture a holistic view of user experience.

Summary #

  • NDCG measures how highly relevant items are positioned near the top, using a logarithmic discount.
  • It’s easy to compute using ndcg_score, but the choice of k and relevance scale greatly affects interpretation.
  • Use NDCG alongside other ranking metrics to comprehensively assess ranking quality.