まとめ
- NDCG is a ranking metric that evaluates results by normalizing the discounted cumulative gain.
- It uses relevance scores to compute DCG/NDCG and shows how logarithmic discounting works.
- We also review considerations for multi-level relevance and cascade models.
1. Definition #
Given a relevance score \(rel_i\) for rank \(i\), the Discounted Cumulative Gain (DCG) is defined as:
$$ \mathrm{DCG@k} = \sum_{i=1}^{k} \frac{2^{rel_i} - 1}{\log_2(i + 1)} $$
To normalize this, we compute the DCG for the ideal ranking order (IDCG) and obtain:
$$ \mathrm{NDCG@k} = \frac{\mathrm{DCG@k}}{\mathrm{IDCG@k}} $$
2. Computing in Python #
from sklearn.metrics import ndcg_score
# y_true: array of true relevance scores with shape (n_samples, n_labels)
# y_score: model output scores
score = ndcg_score(y_true, y_score, k=10)
print("NDCG@10:", round(score, 4))
ndcg_score accepts not only binary relevance (0/1) but also graded integer scores.
The key is to properly define your ground-truth relevance matrix.
3. Hyperparameters #
- k (cutoff): Choose @5, @10, etc., based on how many results are shown to users.
- Relevance scale: Binary scores (0/1) work, but graded relevance levels (e.g., highly relevant, somewhat relevant) yield more nuanced evaluation.
- Log base: While log base 2 is standard, using a different base only changes scaling, not the relative ranking.
4. Practical Applications #
- Search evaluation: Commonly used to measure how well search results align with human-annotated relevance labels.
- Recommendation systems: Treat implicit feedback (views, clicks, purchases) as relevance signals to track ranking improvements.
- A/B testing: Combine NDCG with online metrics to understand how offline improvements translate to real-world performance.
5. Key Considerations #
- Ground-truth labeling is expensive; implicit signals can be noisy.
- In two-stage systems (candidate generation → ranking), use suitable metrics for each phase.
- Combine NDCG with Recall@k, MAP, or other metrics to capture a holistic view of user experience.
Summary #
- NDCG measures how highly relevant items are positioned near the top, using a logarithmic discount.
- It’s easy to compute using
ndcg_score, but the choice ofkand relevance scale greatly affects interpretation. - Use NDCG alongside other ranking metrics to comprehensively assess ranking quality.