4.3.10
Brier Score
- Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
- Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
- Combine charts and complementary metrics for effective model comparison and threshold tuning.
1. Definition #
For binary classification the Brier Score is \mathrm{Brier} = \frac{1}{n} \sum_{i=1}^{n} (p_i - y_i)^2, where \(p_i\) is the predicted probability of the positive class and \(y_i\) is the actual label (0 or 1). For multiclass tasks we compute the squared error per class and average them.
2. Implementation and visualisation in Python 3.13 #
| |
The snippet below trains logistic regression on the breast-cancer dataset, prints the Brier Score, and plots a reliability diagram. The figure is saved to static/images/eval/classification/brier-score/reliability_curve.png, ready to be regenerated by generate_eval_assets.py.
| |

Deviations from the 45° line reveal over- or under-confident probabilities.
3. Interpreting the score #
- Perfectly calibrated probabilities yield 0.
- Always predicting 0.5 on balanced data results in 0.25.
- The smaller the score, the better—the model is punished more heavily when its probabilities are far from the observed outcomes.
4. Diagnose calibration with reliability diagrams #
The reliability diagram groups predictions into bins, plots the average predicted probability on the x-axis, and the empirical positive rate on the y-axis.
- Points below the diagonal → the model is over-confident (predicted probabilities too high).
- Points above the diagonal → the model is under-confident.
- After applying calibration techniques (Platt scaling, isotonic regression, etc.), recompute the Brier Score and the diagram to confirm improvement.
Summary #
- The Brier Score measures the mean squared error of predicted probabilities; lower is better.
- In Python 3.13, rier_score_loss together with a reliability diagram provides a quick calibration check.
- Combine it with ROC-AUC and Precision/Recall metrics to evaluate both ranking ability and probability accuracy.