Brier Score

中級

4.3.10

Brier Score

Last updated 2020-06-03 Read time 3 min
Summary
  • Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
  • Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
  • Combine charts and complementary metrics for effective model comparison and threshold tuning.

1. Definition #

For binary classification the Brier Score is \mathrm{Brier} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2, where \(p_i\) is the predicted probability of the positive class and \(y_i\) is the actual label (0 or 1). For multiclass tasks we compute the squared error per class and average them.


2. Implementation and visualisation in Python 3.13 #

1
2
python --version        # e.g. Python 3.13.0
pip install scikit-learn matplotlib

The snippet below trains logistic regression on the breast-cancer dataset, prints the Brier Score, and plots a reliability diagram. The figure is saved to static/images/eval/classification/brier-score/reliability_curve.png, ready to be regenerated by generate_eval_assets.py.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.calibration import CalibrationDisplay
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=2000, solver="lbfgs"),
)
pipeline.fit(X_train, y_train)
proba = pipeline.predict_proba(X_test)[:, 1]
score = brier_score_loss(y_test, proba)
print(f"Brier Score: {score:.3f}")
fig, ax = plt.subplots(figsize=(5, 5))
CalibrationDisplay.from_predictions(y_test, proba, n_bins=10, ax=ax)
ax.set_title("Reliability Diagram (Breast Cancer Dataset)")
fig.tight_layout()
output_dir = Path("static/images/eval/classification/brier-score")
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "reliability_curve.png", dpi=150)
plt.close(fig)
Reliability diagram

Deviations from the 45° line reveal over- or under-confident probabilities.


3. Interpreting the score #

  • Perfectly calibrated probabilities yield 0.
  • Always predicting 0.5 on balanced data results in 0.25.
  • The smaller the score, the better—the model is punished more heavily when its probabilities are far from the observed outcomes.

4. Diagnose calibration with reliability diagrams #

The reliability diagram groups predictions into bins, plots the average predicted probability on the x-axis, and the empirical positive rate on the y-axis.

  • Points below the diagonal → the model is over-confident (predicted probabilities too high).
  • Points above the diagonal → the model is under-confident.
  • After applying calibration techniques (Platt scaling, isotonic regression, etc.), recompute the Brier Score and the diagram to confirm improvement.

Summary #

  • The Brier Score measures the mean squared error of predicted probabilities; lower is better.
  • In Python 3.13, rier_score_loss together with a reliability diagram provides a quick calibration check.
  • Combine it with ROC-AUC and Precision/Recall metrics to evaluate both ranking ability and probability accuracy.