4.3.6
Balanced Accuracy
Summary
- Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
- Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
- Combine charts and complementary metrics for effective model comparison and threshold tuning.
- Confusion Matrix — understanding this concept first will make learning smoother
1. Definition #
Balanced Accuracy is the mean of the true-positive rate (TPR) and the true-negative rate (TNR): \mathrm{Balanced\ Accuracy} = \frac{1}{2}\left(\frac{TP}{TP + FN} + \frac{TN}{TN + FP}\right) For multiclass problems you average the recall of each class in the same spirit.
2. Implementation in Python 3.13 #
| |
We reuse the random-forest classifier from the Accuracy article and print both metrics side by side. The bar chart is saved at static/images/eval/classification/accuracy/accuracy_vs_balanced.png, so generate_eval_assets.py can regenerate it whenever you update the notebook.
| |

Balanced Accuracy weights each class equally by averaging the recall per class.
3. When to prefer Balanced Accuracy #
- Strong class imbalance – plain Accuracy only reflects the majority class, while Balanced Accuracy keeps minority recall visible.
- Model comparison – when benchmark teams submit models on skewed data, Balanced Accuracy makes their performance differences more honest.
- Threshold tuning – combine it with precision/recall plots to see whether both classes remain detectable at your chosen threshold.
4. Companion metrics #
| Metric | Measures | Caveat on imbalanced data |
|---|---|---|
| Accuracy | Overall hit rate | Dominated by the majority class |
| Recall / Sensitivity | Detection rate per class | Requires separate reporting for each class |
| Balanced Accuracy | Mean recall across classes | Highlights minority-class recall loss |
| Macro F1 | Harmonic mean of precision & recall (per class) | Useful when precision also matters |
Summary #
- Balanced Accuracy is the average of per-class recall, making it well suited to imbalanced datasets.
- In Python 3.13, alanced_accuracy_score gives you the value in one line; compare it with Accuracy to show stakeholders the difference.
- Combine it with precision, recall, and F1 metrics to decide how much weight to give each class when evaluating models.