Brier Score

Last updated 2020-06-03 Read time 3 min

Summary

Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
Combine charts and complementary metrics for effective model comparison and threshold tuning.

1. Definition #

For binary classification the Brier Score is \mathrm{Brier} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2, where \(p_i\) is the predicted probability of the positive class and \(y_i\) is the actual label (0 or 1). For multiclass tasks we compute the squared error per class and average them.

2. Implementation and visualisation in Python 3.13 #

1
2
python --version        # e.g. Python 3.13.0
pip install scikit-learn matplotlib

The snippet below trains logistic regression on the breast-cancer dataset, prints the Brier Score, and plots a reliability diagram. The figure is saved to static/images/eval/classification/brier-score/reliability_curve.png, ready to be regenerated by generate_eval_assets.py.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import matplotlib.pyplot as plt
from pathlib import Path
from sklearn.calibration import CalibrationDisplay
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import brier_score_loss
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
pipeline = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=2000, solver="lbfgs"),
)
pipeline.fit(X_train, y_train)
proba = pipeline.predict_proba(X_test)[:, 1]
score = brier_score_loss(y_test, proba)
print(f"Brier Score: {score:.3f}")
fig, ax = plt.subplots(figsize=(5, 5))
CalibrationDisplay.from_predictions(y_test, proba, n_bins=10, ax=ax)
ax.set_title("Reliability Diagram (Breast Cancer Dataset)")
fig.tight_layout()
output_dir = Path("static/images/eval/classification/brier-score")
output_dir.mkdir(parents=True, exist_ok=True)
fig.savefig(output_dir / "reliability_curve.png", dpi=150)
plt.close(fig)

3. Interpreting the score #

Perfectly calibrated probabilities yield 0.
Always predicting 0.5 on balanced data results in 0.25.
The smaller the score, the better—the model is punished more heavily when its probabilities are far from the observed outcomes.

4. Diagnose calibration with reliability diagrams #

The reliability diagram groups predictions into bins, plots the average predicted probability on the x-axis, and the empirical positive rate on the y-axis.

Points below the diagonal → the model is over-confident (predicted probabilities too high).
Points above the diagonal → the model is under-confident.
After applying calibration techniques (Platt scaling, isotonic regression, etc.), recompute the Brier Score and the diagram to confirm improvement.

Summary #

The Brier Score measures the mean squared error of predicted probabilities; lower is better.
In Python 3.13, rier_score_loss together with a reliability diagram provides a quick calibration check.
Combine it with ROC-AUC and Precision/Recall metrics to evaluate both ranking ability and probability accuracy.