MAP (Mean Average Precision)

Eval

MAP (Mean Average Precision)

まとめ
  • MAP is a ranking metric that averages the Average Precision (AP) across multiple queries.
  • Using a search example, we compute AP and MAP to observe how ranking positions affect the score.
  • We also cover practical issues like long candidate lists and weighting methods.

1. Definition of AP and MAP #

For a single query, the Average Precision (AP) averages the precision at each rank where a correct item is found.

$$ \mathrm{AP} = \frac{1}{|G|} \sum_{k \in G} P(k) $$

Here, \(G\) is the set of ranks of relevant items, and \(P(k)\) is the precision up to position \(k\).
MAP is then the mean of AP values across all queries.

$$ \mathrm{MAP} = \frac{1}{Q} \sum_{q=1}^Q \mathrm{AP}_q $$


2. Computing in Python #

Although Scikit-learn doesn’t provide a direct MAP function, we can compute average_precision_score per query and then take the mean.

from sklearn.metrics import average_precision_score
import numpy as np

aps = []

for q in queries:
    aps.append(average_precision_score(y_true[q], y_score[q]))

map_score = np.mean(aps)
print("MAP:", round(map_score, 4))

y_true[q] represents binary labels (0/1) for the query, and y_score[q] represents model output scores.


3. Characteristics and Advantages #

  • Works well for rankings with multiple relevant items.
  • Rewards systems that find correct items earlier in the list.
  • Reflects both precision and recall, offering a more comprehensive evaluation than simple Precision@k.

4. Practical Applications #

  • Search systems: Evaluate how well the results cover all relevant items.
  • Recommendation systems: Ideal when multiple relevant outputs exist (e.g., products viewed or purchased).
  • Learning to Rank (LTR): Commonly used for offline evaluation in ranking models like LambdaMART or XGBoost.

5. Points to Note #

  • If the number of relevant items per query varies greatly, MAP can be biased — consider Weighted MAP.
  • Queries with no relevant items (only 0s) yield undefined AP; decide whether to exclude or assign zero.
  • Use together with NDCG, Recall@k, and other metrics for a holistic ranking evaluation.

Summary #

  • MAP is the average of average precisions — ideal for rankings with multiple correct answers.
  • It’s simple to compute by averaging AP values per query.
  • Combine with NDCG or Recall@k to improve ranking system performance from multiple perspectives.