まとめ
- Map the key metrics used in binary, multi-class, and multi-label classification.
- Compare confusion-matrix metrics, threshold-based/ranking metrics, and probability-calibration metrics.
- Summarise how to assemble a metric set that aligns with business goals and how to report the results.
Chapter 3 #
Classification metrics overview #
Evaluating a classifier depends on several viewpoints: class balance, thresholding strategy, calibrated probabilities, and the quality of the top-ranked suggestions. This chapter organises representative metrics into those buckets and offers guidance on when to prioritise each.
Metric categories #
1. Confusion-matrix based #
- Accuracy (Accuracy): overall hit rate; misleading when the classes are imbalanced.
- Precision / Recall / F1 (Precision & Recall, F1-score): choose according to the relative cost of false positives and false negatives.
- Specificity / Sensitivity (Sensitivity & Specificity): essential in domains like medical screening.
- Macro / Micro / Weighted averaging (Averaging strategies): aggregate per-class metrics in multi-class settings.
2. Threshold and ranking #
- Precision-Recall curve / PR-AUC (Precision-Recall): highlights performance when the positive class is rare.
- ROC curve / ROC-AUC (ROC-AUC): measures separability across all thresholds.
- Top-k Accuracy / Hit Rate (Top-k Accuracy, Hit Rate): relevant for recommendation/search scenarios where only the top suggestions are shown.
3. Probability calibration #
- Log Loss (Log Loss): rewards well-calibrated probabilities.
- Brier Score (Brier Score): pairs nicely with reliability curves to assess calibration quality.
- Calibration curves: compare predicted probabilities with observed frequencies.
4. Class-imbalance helpers #
- Balanced Accuracy (Balanced Accuracy): averages per-class recall.
- Cohen’s Kappa / MCC (Cohen’s κ, Matthews Correlation Coefficient): robust alternatives when label imbalance is severe.
How thresholds affect the scores #
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
X, y = make_classification(
n_samples=2000,
n_features=12,
n_informative=4,
weights=[0.85, 0.15],
random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, stratify=y, random_state=42
)
scaler = StandardScaler()
model = LogisticRegression(max_iter=2000)
model.fit(scaler.fit_transform(X_train), y_train)
prob = model.predict_proba(scaler.transform(X_test))[:, 1]
thresholds = np.linspace(0.05, 0.95, 19)
precision, recall, f1 = [], [], []
for t in thresholds:
y_pred = (prob >= t).astype(int)
precision.append(precision_score(y_test, y_pred, zero_division=0))
recall.append(recall_score(y_test, y_pred, zero_division=0))
f1.append(f1_score(y_test, y_pred, zero_division=0))
fig, ax = plt.subplots(figsize=(6.8, 4))
ax.plot(thresholds, precision, label="Precision", color="#2563eb")
ax.plot(thresholds, recall, label="Recall", color="#dc2626")
ax.plot(thresholds, f1, label="F1", color="#0d9488")
ax.set_xlabel("Threshold")
ax.set_ylabel("Score")
ax.set_title("Effect of threshold on classification metrics")
ax.set_ylim(0, 1.05)
ax.grid(alpha=0.3)
ax.legend()
plt.tight_layout()

Lowering the threshold raises recall but hurts precision. F1 peaks near the balance point and is often used to pick an operating threshold.
Reporting and operations checklist #
- Always include the confusion matrix
It reveals per-class error patterns and highlights critical classes. - Justify the chosen threshold
Use PR/ROC curves or cost analysis to explain the operating point. - Check probability calibration
If scores drive pricing or resource allocation, inspect Brier Score and calibration plots. - Monitor imbalance impact
Compare Balanced Accuracy and MCC alongside Accuracy to avoid misleading improvements. - Track drift after deployment
Watch Precision/Recall, PR-AUC, and ROC-AUC over time and recalibrate thresholds when needed.
Quick reference #
| Perspective | Representative metrics | Related pages | Notes |
|---|---|---|---|
| Overall accuracy | Accuracy / Balanced Accuracy | Accuracy / Balanced Accuracy | Report both when classes are imbalanced |
| False positives vs. false negatives | Precision / Recall / Fβ | Precision-Recall / F1-score | Combine with threshold analysis |
| Ranking quality | PR-AUC / ROC-AUC / Top-k | PR curve / ROC-AUC / Top-k Accuracy | Tailored to imbalanced or recommendation tasks |
| Probability calibration | Log Loss / Brier Score | Log Loss / Brier Score | Needed when probabilities feed decisions |
| Robustness | MCC / Cohen’s κ | MCC / Cohen’s κ | Stable under class imbalance |
Final checklist #
- Combined metrics that reflect the class balance
- Shared the rationale behind the chosen threshold (PR/ROC or cost analysis)
- Verified probability calibration before using scores operationally
- Confirmed evaluation and production data follow comparable distributions
- Established consistent baseline metrics for future model updates