Classification

Eval

Classification

まとめ
  • Map the key metrics used in binary, multi-class, and multi-label classification.
  • Compare confusion-matrix metrics, threshold-based/ranking metrics, and probability-calibration metrics.
  • Summarise how to assemble a metric set that aligns with business goals and how to report the results.

Chapter 3 #

Classification metrics overview #

Evaluating a classifier depends on several viewpoints: class balance, thresholding strategy, calibrated probabilities, and the quality of the top-ranked suggestions. This chapter organises representative metrics into those buckets and offers guidance on when to prioritise each.


Metric categories #

1. Confusion-matrix based #

  • Accuracy (Accuracy): overall hit rate; misleading when the classes are imbalanced.
  • Precision / Recall / F1 (Precision & Recall, F1-score): choose according to the relative cost of false positives and false negatives.
  • Specificity / Sensitivity (Sensitivity & Specificity): essential in domains like medical screening.
  • Macro / Micro / Weighted averaging (Averaging strategies): aggregate per-class metrics in multi-class settings.

2. Threshold and ranking #

  • Precision-Recall curve / PR-AUC (Precision-Recall): highlights performance when the positive class is rare.
  • ROC curve / ROC-AUC (ROC-AUC): measures separability across all thresholds.
  • Top-k Accuracy / Hit Rate (Top-k Accuracy, Hit Rate): relevant for recommendation/search scenarios where only the top suggestions are shown.

3. Probability calibration #

  • Log Loss (Log Loss): rewards well-calibrated probabilities.
  • Brier Score (Brier Score): pairs nicely with reliability curves to assess calibration quality.
  • Calibration curves: compare predicted probabilities with observed frequencies.

4. Class-imbalance helpers #


How thresholds affect the scores #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = make_classification(
    n_samples=2000,
    n_features=12,
    n_informative=4,
    weights=[0.85, 0.15],
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

scaler = StandardScaler()
model = LogisticRegression(max_iter=2000)
model.fit(scaler.fit_transform(X_train), y_train)

prob = model.predict_proba(scaler.transform(X_test))[:, 1]
thresholds = np.linspace(0.05, 0.95, 19)

precision, recall, f1 = [], [], []
for t in thresholds:
    y_pred = (prob >= t).astype(int)
    precision.append(precision_score(y_test, y_pred, zero_division=0))
    recall.append(recall_score(y_test, y_pred, zero_division=0))
    f1.append(f1_score(y_test, y_pred, zero_division=0))

fig, ax = plt.subplots(figsize=(6.8, 4))
ax.plot(thresholds, precision, label="Precision", color="#2563eb")
ax.plot(thresholds, recall, label="Recall", color="#dc2626")
ax.plot(thresholds, f1, label="F1", color="#0d9488")
ax.set_xlabel("Threshold")
ax.set_ylabel("Score")
ax.set_title("Effect of threshold on classification metrics")
ax.set_ylim(0, 1.05)
ax.grid(alpha=0.3)
ax.legend()
plt.tight_layout()
Precision, recall, and F1 across thresholds

Lowering the threshold raises recall but hurts precision. F1 peaks near the balance point and is often used to pick an operating threshold.


Reporting and operations checklist #

  1. Always include the confusion matrix
    It reveals per-class error patterns and highlights critical classes.
  2. Justify the chosen threshold
    Use PR/ROC curves or cost analysis to explain the operating point.
  3. Check probability calibration
    If scores drive pricing or resource allocation, inspect Brier Score and calibration plots.
  4. Monitor imbalance impact
    Compare Balanced Accuracy and MCC alongside Accuracy to avoid misleading improvements.
  5. Track drift after deployment
    Watch Precision/Recall, PR-AUC, and ROC-AUC over time and recalibrate thresholds when needed.

Quick reference #

PerspectiveRepresentative metricsRelated pagesNotes
Overall accuracyAccuracy / Balanced AccuracyAccuracy / Balanced AccuracyReport both when classes are imbalanced
False positives vs. false negativesPrecision / Recall / FβPrecision-Recall / F1-scoreCombine with threshold analysis
Ranking qualityPR-AUC / ROC-AUC / Top-kPR curve / ROC-AUC / Top-k AccuracyTailored to imbalanced or recommendation tasks
Probability calibrationLog Loss / Brier ScoreLog Loss / Brier ScoreNeeded when probabilities feed decisions
RobustnessMCC / Cohen’s κMCC / Cohen’s κStable under class imbalance

Final checklist #

  • Combined metrics that reflect the class balance
  • Shared the rationale behind the chosen threshold (PR/ROC or cost analysis)
  • Verified probability calibration before using scores operationally
  • Confirmed evaluation and production data follow comparable distributions
  • Established consistent baseline metrics for future model updates