Precision-Recall

Eval

Precision, Recall, and F1 | Threshold tuning in Python 3.13

Created: Last updated: Read time: 3 min
まとめ
  • Precision, Recall, and F1 | Threshold tuning in Python 3.13の概要を押さえ、評価対象と読み取り方を整理します。
  • Python 3.13 のコード例で算出・可視化し、手順と実務での確認ポイントを確認します。
  • 図表や補助指標を組み合わせ、モデル比較や閾値調整に活かすヒントをまとめます。

1. Definitions at a glance #

Given the confusion-matrix counts — true positives (TP), false positives (FP), false negatives (FN) — precision and recall are defined as: \text{Precision} = \frac{TP}{TP + FP}, \qquad \text{Recall} = \frac{TP}{TP + FN}

  • Precision — of the items predicted as positive, which fraction is truly positive? Important when false alarms are costly.
  • Recall — of the actual positives, which fraction do we catch? Important when misses must be avoided.
  • F1 score — harmonic mean that balances precision and recall in a single number. F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}

2. Implementation and PR curve on Python 3.13 #

Install the dependencies in your interpreter:

python --version        # e.g. Python 3.13.0
pip install scikit-learn matplotlib

The snippet below creates an imbalanced dataset (positive class 5%), trains a weighted logistic regression, and plots the precision–recall (PR) curve along with the average precision (AP). The figure is stored at static/images/eval/classification/precision-recall/pr_curve.png so generate_eval_assets.py can refresh it automatically.

import matplotlib.pyplot as plt
import numpy as np
from pathlib import Path
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    classification_report,
    precision_recall_curve,
    precision_score,
    recall_score,
    f1_score,
    average_precision_score,
)
from sklearn.model_selection import train_test_split
X, y = make_classification(
    n_samples=40_000,
    n_features=20,
    n_informative=6,
    weights=[0.95, 0.05],
    random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)
clf = LogisticRegression(max_iter=2000, class_weight="balanced")
clf.fit(X_train, y_train)
proba = clf.predict_proba(X_test)[:, 1]
y_pred = (proba >= 0.5).astype(int)
print(classification_report(y_test, y_pred, digits=3))
precision, recall, thresholds = precision_recall_curve(y_test, proba)
ap = average_precision_score(y_test, proba)
fig, ax = plt.subplots(figsize=(5, 4))
ax.step(recall, precision, where="post", label=f"PR curve (AP={ap:.3f})")
ax.set_xlabel("Recall")
ax.set_ylabel("Precision")
ax.set_xlim(0, 1)
ax.set_ylim(0, 1.05)
ax.grid(alpha=0.3)
ax.legend()
fig.tight_layout()
output_dir = Path("static/images/eval/classification/precision-recall")
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "pr_curve.png", dpi=150)
plt.close(fig)
Precision–Recall curve

Average precision (AP) is the area under the PR curve. Moving the threshold slides along the curve.


3. Choosing a decision threshold #

  • Keeping the default threshold (0.5) might yield low recall when the positive class is rare.
  • Every point on the PR curve corresponds to a threshold from precision_recall_curve’s hresholds array.
  • Lowering the threshold increases recall at the expense of precision; use business cost to decide how far you can move.
threshold = 0.3
custom_pred = (proba >= threshold).astype(int)
print(
    "threshold=0.30",
    "Precision=", precision_score(y_test, custom_pred),
    "Recall=", recall_score(y_test, custom_pred),
    "F1=", f1_score(y_test, custom_pred),
)

4. Averaging strategies for multiclass tasks #

scikit-learn’s verage parameter lets you aggregate class-wise metrics:

  • macro — simple mean across classes; treats each class equally.
  • weighted — weighted mean by class support; preserves overall balance.
  • micro — recompute from the pooled confusion matrix; useful but can hide minority-class behaviour.
precision_score(y_test, y_pred, average="macro")
recall_score(y_test, y_pred, average="weighted")
f1_score(y_test, y_pred, average="micro")

Summary #

  • Precision penalises false positives, recall penalises false negatives; F1 balances both.
  • Precision–recall curves reveal how the trade-off changes with the threshold; the average precision condenses it into a single number.
  • In Python 3.13 you can compute and plot the curve with a few lines of scikit-learn, then share the threshold analysis for collaborative decision-making.