4.5.2
MAP (Mean Average Precision)
Summary
- MAP is a ranking metric that averages the Average Precision (AP) across multiple queries.
- Using a search example, we compute AP and MAP to observe how ranking positions affect the score.
- We also cover practical issues like long candidate lists and weighting methods.
1. Definition of AP and MAP #
For a single query, the Average Precision (AP) averages the precision at each rank where a correct item is found.
$$ \mathrm{AP} = \frac{1}{|G|} \sum_{k \in G} P(k) $$Here, \(G\) is the set of ranks of relevant items, and \(P(k)\) is the precision up to position \(k\).
MAP is then the mean of AP values across all queries.
2. Computing in Python #
Although Scikit-learn doesn’t provide a direct MAP function, we can compute average_precision_score per query and then take the mean.
| |
y_true[q] represents binary labels (0/1) for the query, and y_score[q] represents model output scores.
3. Characteristics and Advantages #
- Works well for rankings with multiple relevant items.
- Rewards systems that find correct items earlier in the list.
- Reflects both precision and recall, offering a more comprehensive evaluation than simple Precision@k.
4. Practical Applications #
- Search systems: Evaluate how well the results cover all relevant items.
- Recommendation systems: Ideal when multiple relevant outputs exist (e.g., products viewed or purchased).
- Learning to Rank (LTR): Commonly used for offline evaluation in ranking models like LambdaMART or XGBoost.
5. Points to Note #
- If the number of relevant items per query varies greatly, MAP can be biased — consider Weighted MAP.
- Queries with no relevant items (only 0s) yield undefined AP; decide whether to exclude or assign zero.
- Use together with NDCG, Recall@k, and other metrics for a holistic ranking evaluation.
Summary #
- MAP is the average of average precisions — ideal for rankings with multiple correct answers.
- It’s simple to compute by averaging AP values per query.
- Combine with NDCG or Recall@k to improve ranking system performance from multiple perspectives.