4.3.14
Averaging Strategies
Summary
- Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
- Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
- Combine charts and complementary metrics for effective model comparison and threshold tuning.
- Precision & Recall — understanding this concept first will make learning smoother
- F1 Score — understanding this concept first will make learning smoother
1. Main averaging options #
| average | How it is computed | When to use it |
|---|---|---|
micro | Sum TP/FP/FN over all samples, then compute the metric | Emphasises overall correctness regardless of class distribution |
macro | Compute the metric per class, then take the unweighted mean | Gives every class the same weight; highlights minority classes |
weighted | Compute the metric per class, then take a support-weighted mean | Preserves class ratios; behaves closer to Accuracy |
samples | Multi-label only. Average metrics per sample | For cases where each sample can have multiple labels |
2. Comparing in Python 3.13 #
| |
| |
classification_report prints per-class metrics as well as macro avg, weighted avg, and micro avg, making it easy to compare the different strategies side by side.
3. Picking the right strategy #
- micro – Best when you care about overall correctness and every prediction carries the same importance.
- macro – Use when minority classes matter; it treats every class equally and penalises poor recall on rare labels.
- weighted – Useful when you want to stay close to the real class distribution while still reporting Precision/Recall/F1.
- samples – The default choice for multi-label tasks where each sample can have several ground-truth labels.
Takeaways #
- The
averageparameter changes the meaning of the resulting metric drastically; match it with your task and business goal. - Remember:
macrotreats classes equally,microfocuses on overall ratios,weightedpreserves class balance, andsamplestargets multi-label use cases. - scikit-learn lets you compute several averages in one go, so report multiple views to avoid misinterpreting model quality.