4.5.2
MAP (Mean Average Precision) | Ranking evaluation across multiple queries
- MAP is a ranking metric that averages the Average Precision (AP) across multiple queries.
- Using a search example, we compute AP and MAP to observe how ranking positions affect the score.
- We also cover practical issues like long candidate lists and weighting methods.
1. Definition of AP and MAP #
For a single query, the Average Precision (AP) averages the precision at each rank where a correct item is found.
$$ \mathrm{AP} = \frac{1}{|G|} \sum_{k \in G} P(k) $$Here, \(G\) is the set of ranks of relevant items, and \(P(k)\) is the precision up to position \(k\).
MAP is then the mean of AP values across all queries.
2. Computing in Python #
Although Scikit-learn doesn’t provide a direct MAP function, we can compute average_precision_score per query and then take the mean.
| |
y_true[q] represents binary labels (0/1) for the query, and y_score[q] represents model output scores.
3. Characteristics and Advantages #
- Works well for rankings with multiple relevant items.
- Rewards systems that find correct items earlier in the list.
- Reflects both precision and recall, offering a more comprehensive evaluation than simple Precision@k.
4. Practical Applications #
- Search systems: Evaluate how well the results cover all relevant items.
- Recommendation systems: Ideal when multiple relevant outputs exist (e.g., products viewed or purchased).
- Learning to Rank (LTR): Commonly used for offline evaluation in ranking models like LambdaMART or XGBoost.
5. Points to Note #
- If the number of relevant items per query varies greatly, MAP can be biased — consider Weighted MAP.
- Queries with no relevant items (only 0s) yield undefined AP; decide whether to exclude or assign zero.
- Use together with NDCG, Recall@k, and other metrics for a holistic ranking evaluation.
Summary #
- MAP is the average of average precisions — ideal for rankings with multiple correct answers.
- It’s simple to compute by averaging AP values per query.
- Combine with NDCG or Recall@k to improve ranking system performance from multiple perspectives.
FAQ #
What is Mean Average Precision (MAP)? #
Mean Average Precision (MAP) is a ranking evaluation metric that measures how well a system ranks all relevant items across multiple queries. For each query, it computes Average Precision (AP) — the average of precision values at every position where a relevant item is found. MAP is then the mean of AP across all queries.
A MAP of 1.0 means every relevant item was ranked first for every query; MAP near 0 means relevant items appear at the bottom of every list.
What is the formula for MAP? #
For a single query with relevant items at ranks \(G\):
$$ \mathrm{AP} = \frac{1}{|G|} \sum_{k \in G} P(k) $$Averaged over \(Q\) queries:
$$ \mathrm{MAP} = \frac{1}{Q} \sum_{q=1}^Q \mathrm{AP}_q $$where \(P(k)\) is precision at rank \(k\). MAP rewards finding relevant items early (high ranks) more than finding them late.
How is MAP different from Precision@k? #
Precision@k measures the fraction of relevant items in the top-k results — it ignores the order within that top-k and ignores items ranked below k. MAP considers the exact rank of every relevant item and penalises late discoveries, giving a more complete picture of ranking quality.
Use Precision@k when users only see the top k results. Use MAP when you care about retrieving all relevant items in good order.
When should I use MAP vs NDCG? #
| MAP | NDCG | |
|---|---|---|
| Relevance labels | Binary (relevant/not) | Graded (0, 1, 2, 3…) |
| Emphasis | Finding all relevant items | Highly relevant items first |
| Common use | IR benchmarks, academic | Search engines, recommendations |
Use MAP when relevance is binary (relevant or not). Use NDCG when you have graded relevance scores (e.g., 5-star ratings or editorial relevance judgements).
How do I handle queries with no relevant items? #
If a query has no relevant items, AP is undefined (division by zero). The standard approach is to exclude such queries from the MAP calculation. Alternatively, assign AP = 0 and document the decision, as this artificially lowers the score. Always report the total number of queries and the number excluded.