Accuracy

Last updated 2020-01-29 Read time 3 min

Summary

Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
Combine charts and complementary metrics for effective model comparison and threshold tuning.

Confusion Matrix — understanding this concept first will make learning smoother

1. Definition #

Using the confusion-matrix entries (true positive TP, false positive FP, false negative FN, true negative TN), accuracy is defined as:

$$ \mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $$

It measures the overall hit rate, but by itself says nothing about class imbalance. Always pair it with other metrics when positive and negative samples appear at very different frequencies.

2. Implementation and visualisation on Python 3.13 #

Confirm the interpreter and install the required packages:

1
2
3
4

python --version        # e.g. Python 3.13.0

pip install scikit-learn matplotlib

The script below trains a random forest on the breast-cancer dataset, computes Accuracy and Balanced Accuracy, and plots both as a bar chart. A Pipeline with StandardScaler keeps the preprocessing consistent. Images are saved under static/images/eval/... so that generate_eval_assets.py can refresh them automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import matplotlib.pyplot as plt

import numpy as np

from pathlib import Path

from sklearn.datasets import load_breast_cancer

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, balanced_accuracy_score

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.25, random_state=42, stratify=y

)

pipeline = make_pipeline(

    StandardScaler(),

    RandomForestClassifier(random_state=42, n_estimators=300),

)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

acc = accuracy_score(y_test, y_pred)

bal_acc = balanced_accuracy_score(y_test, y_pred)

print(f"Accuracy: {acc:.3f}, Balanced Accuracy: {bal_acc:.3f}")

fig, ax = plt.subplots(figsize=(5, 4))

scores = np.array([acc, bal_acc])

labels = ["Accuracy", "Balanced Accuracy"]

colors = ["#2563eb", "#f97316"]

bars = ax.bar(labels, scores, color=colors)

ax.set_ylim(0, 1.05)

for bar, score in zip(bars, scores):

    ax.text(bar.get_x() + bar.get_width() / 2, score + 0.02, f"{score:.3f}", ha="center", va="bottom")

ax.set_ylabel("Score")

ax.set_title("Accuracy vs. Balanced Accuracy (Breast Cancer Dataset)")

ax.grid(axis="y", linestyle="--", alpha=0.4)

fig.tight_layout()

output_dir = Path("static/images/eval/classification/accuracy")

output_dir.mkdir(parents=True, exist_ok=True)

fig.savefig(output_dir / "accuracy_vs_balanced.png", dpi=150)

plt.close(fig)

Accuracy compared to Balanced Accuracy — Balanced Accuracy exposes hidden errors when classes are imbalanced.

3. Handling class imbalance #

Accuracy does not differentiate the cost of false negatives vs. false positives. On skewed datasets, supplement it with:

Precision / Recall / F1 — to understand false alarms versus misses.
Balanced Accuracy — averages recall per class, making minority classes visible.
Confusion Matrix — shows which classes dominate the mistakes.
ROC-AUC / PR curves — inspect probability thresholds and trade-offs. Balanced Accuracy equals the mean recall of each class and is a good default when outcomes are skewed or when compliance requires a fairness-aware score.

4. Operational checklist #

Align with business cost – check the confusion matrix and confirm that the “99 % accuracy” claim does not mask critical misses.
Explore thresholding – analyse ROC-AUC or PR curves to see how accuracy changes when you adjust the decision threshold.
Report multiple metrics – include Precision, Recall, F1, and Balanced Accuracy in dashboards so stakeholders recognise trade-offs.
Keep reproducible notebooks – store the evaluation in a Python 3.13 notebook to re-run it quickly after model updates.

Summary #

Accuracy is a convenient headline metric, but can mislead on imbalanced data.
A scikit-learn pipeline with scaling makes the calculation reproducible in Python 3.13.
Combine Accuracy with Balanced Accuracy and class-wise metrics to build a trustworthy evaluation narrative.