4.3.8
F1 Score
Summary
- Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
- Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
- Combine charts and complementary metrics for effective model comparison and threshold tuning.
- Precision & Recall — understanding this concept first will make learning smoother
1. Definition #
With precision \(P\) and recall \(R\), F1 is defined as F_1 = 2 \cdot \frac{P \cdot R}{P + R}. The general Fβ score gives more weight to recall (\(\beta > 1\)) or precision (\(\beta < 1\)): F_\beta = (1 + \beta^2) \cdot \frac{P \cdot R}{\beta^2 P + R}.
2. Computing F1 in Python 3.13 #
| |
| |
classification_report displays precision, recall, and F1 per class in one table.
3. How F1 varies with the threshold #
Using probability outputs we can plot how F1 evolves as the decision threshold changes.
| |

Use the curve to locate the threshold that maximises F1, or to trade precision for recall as requirements change.
- The peak indicates the best trade-off between precision and recall when both are equally important.
- Use F0.5 or F2 when you want to bias the trade-off toward precision or recall respectively.
4. Averaging strategies for multiclass #
scikit-learn’s verage parameter lets you aggregate F1 for multiclass or multilabel data:
- macro — compute F1 per class and take the (unweighted) mean.
- weighted — average per-class F1 weighted by class support.
- micro — pool all predictions and recompute from the global confusion matrix.
| |
For multilabel problems verage=“samples” reports the mean per sample.
Summary #
- F1 balances precision and recall; plotting it across thresholds helps you choose the operating point.
- Fβ scores adapt the balance when either recall (β>1) or precision (β<1) must dominate.
- On multiclass tasks, specify the averaging strategy and review precision, recall, F1, and PR curves together to understand the classifier’s behaviour.