4.3
Classification
Summary
- Map the key metrics used in binary, multi-class, and multi-label classification.
- Compare confusion-matrix metrics, threshold-based/ranking metrics, and probability-calibration metrics.
- Summarise how to assemble a metric set that aligns with business goals and how to report the results.
Classification metrics overview #
Evaluating a classifier depends on several viewpoints: class balance, thresholding strategy, calibrated probabilities, and the quality of the top-ranked suggestions. This chapter organises representative metrics into those buckets and offers guidance on when to prioritise each.
Metric categories #
1. Confusion-matrix based #
- Accuracy (Accuracy): overall hit rate; misleading when the classes are imbalanced.
- Precision / Recall / F1 (Precision & Recall, F1-score): choose according to the relative cost of false positives and false negatives.
- Specificity / Sensitivity (Sensitivity & Specificity): essential in domains like medical screening.
- Macro / Micro / Weighted averaging (Averaging strategies): aggregate per-class metrics in multi-class settings.
2. Threshold and ranking #
- Precision-Recall curve / PR-AUC (Precision-Recall): highlights performance when the positive class is rare.
- ROC curve / ROC-AUC (ROC-AUC): measures separability across all thresholds.
- Top-k Accuracy / Hit Rate (Top-k Accuracy, Hit Rate): relevant for recommendation/search scenarios where only the top suggestions are shown.
3. Probability calibration #
- Log Loss (Log Loss): rewards well-calibrated probabilities.
- Brier Score (Brier Score): pairs nicely with reliability curves to assess calibration quality.
- Calibration curves: compare predicted probabilities with observed frequencies.
4. Class-imbalance helpers #
- Balanced Accuracy (Balanced Accuracy): averages per-class recall.
- Cohen’s Kappa / MCC (Cohen’s κ, Matthews Correlation Coefficient): robust alternatives when label imbalance is severe.
How thresholds affect the scores #
| |

Lowering the threshold raises recall but hurts precision. F1 peaks near the balance point and is often used to pick an operating threshold.
Reporting and operations checklist #
- Always include the confusion matrix
It reveals per-class error patterns and highlights critical classes. - Justify the chosen threshold
Use PR/ROC curves or cost analysis to explain the operating point. - Check probability calibration
If scores drive pricing or resource allocation, inspect Brier Score and calibration plots. - Monitor imbalance impact
Compare Balanced Accuracy and MCC alongside Accuracy to avoid misleading improvements. - Track drift after deployment
Watch Precision/Recall, PR-AUC, and ROC-AUC over time and recalibrate thresholds when needed.
Quick reference #
| Perspective | Representative metrics | Related pages | Notes |
|---|---|---|---|
| Overall accuracy | Accuracy / Balanced Accuracy | Accuracy / Balanced Accuracy | Report both when classes are imbalanced |
| False positives vs. false negatives | Precision / Recall / Fβ | Precision-Recall / F1-score | Combine with threshold analysis |
| Ranking quality | PR-AUC / ROC-AUC / Top-k | PR curve / ROC-AUC / Top-k Accuracy | Tailored to imbalanced or recommendation tasks |
| Probability calibration | Log Loss / Brier Score | Log Loss / Brier Score | Needed when probabilities feed decisions |
| Robustness | MCC / Cohen’s κ | MCC / Cohen’s κ | Stable under class imbalance |
Final checklist #
- Combined metrics that reflect the class balance
- Shared the rationale behind the chosen threshold (PR/ROC or cost analysis)
- Verified probability calibration before using scores operationally
- Confirmed evaluation and production data follow comparable distributions
- Established consistent baseline metrics for future model updates