Averaging Strategies

Eval

Choosing Averaging Strategies for Classification Metrics

最終更新: 2 分で読めます このページを編集
まとめ
  • Choosing Averaging Strategies for Classification Metricsの概要を押さえ、評価対象と読み取り方を整理します。
  • Python 3.13 のコード例で算出・可視化し、手順と実務での確認ポイントを確認します。
  • 図表や補助指標を組み合わせ、モデル比較や閾値調整に活かすヒントをまとめます。

1. Main averaging options #

averageHow it is computedWhen to use it
microSum TP/FP/FN over all samples, then compute the metricEmphasises overall correctness regardless of class distribution
macroCompute the metric per class, then take the unweighted meanGives every class the same weight; highlights minority classes
weightedCompute the metric per class, then take a support-weighted meanPreserves class ratios; behaves closer to Accuracy
samplesMulti-label only. Average metrics per sampleFor cases where each sample can have multiple labels

2. Comparing in Python 3.13 #

python --version  # e.g. Python 3.13.0
pip install scikit-learn matplotlib
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = make_classification(
    n_samples=30_000,
    n_features=20,
    n_informative=6,
    weights=[0.85, 0.1, 0.05],  # Imbalanced classes
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

model = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=2000, multi_class="ovr"),
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred, digits=3))
for avg in ["micro", "macro", "weighted"]:
    print(f"F1 ({avg}):", f1_score(y_test, y_pred, average=avg))

classification_report prints per-class metrics as well as macro avg, weighted avg, and micro avg, making it easy to compare the different strategies side by side.


3. Picking the right strategy #

  • micro – Best when you care about overall correctness and every prediction carries the same importance.
  • macro – Use when minority classes matter; it treats every class equally and penalises poor recall on rare labels.
  • weighted – Useful when you want to stay close to the real class distribution while still reporting Precision/Recall/F1.
  • samples – The default choice for multi-label tasks where each sample can have several ground-truth labels.

Takeaways #

  • The average parameter changes the meaning of the resulting metric drastically; match it with your task and business goal.
  • Remember: macro treats classes equally, micro focuses on overall ratios, weighted preserves class balance, and samples targets multi-label use cases.
  • scikit-learn lets you compute several averages in one go, so report multiple views to avoid misinterpreting model quality.