まとめ

Map the key metrics used in binary, multi-class, and multi-label classification.
Compare confusion-matrix metrics, threshold-based/ranking metrics, and probability-calibration metrics.
Summarise how to assemble a metric set that aligns with business goals and how to report the results.

Chapter 3 #

Classification metrics overview #

Evaluating a classifier depends on several viewpoints: class balance, thresholding strategy, calibrated probabilities, and the quality of the top-ranked suggestions. This chapter organises representative metrics into those buckets and offers guidance on when to prioritise each.

Metric categories #

1. Confusion-matrix based #

Accuracy (Accuracy): overall hit rate; misleading when the classes are imbalanced.
Precision / Recall / F1 (Precision & Recall, F1-score): choose according to the relative cost of false positives and false negatives.
Specificity / Sensitivity (Sensitivity & Specificity): essential in domains like medical screening.
Macro / Micro / Weighted averaging (Averaging strategies): aggregate per-class metrics in multi-class settings.

2. Threshold and ranking #

Precision-Recall curve / PR-AUC (Precision-Recall): highlights performance when the positive class is rare.
ROC curve / ROC-AUC (ROC-AUC): measures separability across all thresholds.
Top-k Accuracy / Hit Rate (Top-k Accuracy, Hit Rate): relevant for recommendation/search scenarios where only the top suggestions are shown.

3. Probability calibration #

Log Loss (Log Loss): rewards well-calibrated probabilities.
Brier Score (Brier Score): pairs nicely with reliability curves to assess calibration quality.
Calibration curves: compare predicted probabilities with observed frequencies.

4. Class-imbalance helpers #

Balanced Accuracy (Balanced Accuracy): averages per-class recall.
Cohen’s Kappa / MCC (Cohen’s κ, Matthews Correlation Coefficient): robust alternatives when label imbalance is severe.

How thresholds affect the scores #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = make_classification(
    n_samples=2000,
    n_features=12,
    n_informative=4,
    weights=[0.85, 0.15],
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

scaler = StandardScaler()
model = LogisticRegression(max_iter=2000)
model.fit(scaler.fit_transform(X_train), y_train)

prob = model.predict_proba(scaler.transform(X_test))[:, 1]
thresholds = np.linspace(0.05, 0.95, 19)

precision, recall, f1 = [], [], []
for t in thresholds:
    y_pred = (prob >= t).astype(int)
    precision.append(precision_score(y_test, y_pred, zero_division=0))
    recall.append(recall_score(y_test, y_pred, zero_division=0))
    f1.append(f1_score(y_test, y_pred, zero_division=0))

fig, ax = plt.subplots(figsize=(6.8, 4))
ax.plot(thresholds, precision, label="Precision", color="#2563eb")
ax.plot(thresholds, recall, label="Recall", color="#dc2626")
ax.plot(thresholds, f1, label="F1", color="#0d9488")
ax.set_xlabel("Threshold")
ax.set_ylabel("Score")
ax.set_title("Effect of threshold on classification metrics")
ax.set_ylim(0, 1.05)
ax.grid(alpha=0.3)
ax.legend()
plt.tight_layout()

Precision, recall, and F1 across thresholds — Lowering the threshold raises recall but hurts precision. F1 peaks near the balance point and is often used to pick an operating threshold.

Reporting and operations checklist #

Always include the confusion matrix
It reveals per-class error patterns and highlights critical classes.
Justify the chosen threshold
Use PR/ROC curves or cost analysis to explain the operating point.
Check probability calibration
If scores drive pricing or resource allocation, inspect Brier Score and calibration plots.
Monitor imbalance impact
Compare Balanced Accuracy and MCC alongside Accuracy to avoid misleading improvements.
Track drift after deployment
Watch Precision/Recall, PR-AUC, and ROC-AUC over time and recalibrate thresholds when needed.

Quick reference #

Perspective	Representative metrics	Related pages	Notes
Overall accuracy	Accuracy / Balanced Accuracy	Accuracy / Balanced Accuracy	Report both when classes are imbalanced
False positives vs. false negatives	Precision / Recall / Fβ	Precision-Recall / F1-score	Combine with threshold analysis
Ranking quality	PR-AUC / ROC-AUC / Top-k	PR curve / ROC-AUC / Top-k Accuracy	Tailored to imbalanced or recommendation tasks
Probability calibration	Log Loss / Brier Score	Log Loss / Brier Score	Needed when probabilities feed decisions
Robustness	MCC / Cohen’s κ	MCC / Cohen’s κ	Stable under class imbalance

Final checklist #

Combined metrics that reflect the class balance
Shared the rationale behind the chosen threshold (PR/ROC or cost analysis)
Verified probability calibration before using scores operationally
Confirmed evaluation and production data follow comparable distributions
Established consistent baseline metrics for future model updates