Confusion Matrix

Last updated 2020-01-15 Read time 3 min

Summary

Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
Combine charts and complementary metrics for effective model comparison and threshold tuning.

1. Anatomy of a confusion matrix #

For binary classification the matrix is a 2×2 table:

	Predicted: Negative	Predicted: Positive
Actual: Negative	True Negative (TN)	False Positive (FP)
Actual: Positive	False Negative (FN)	True Positive (TP)

Rows represent the ground truth, columns the model prediction.
Inspecting TP / FP / FN / TN reveals whether the model is biased toward a specific class.

2. End-to-end example on Python 3.13 #

Make sure you are running Python 3.13 and install the required libraries:

1
2
3
4

python --version  # e.g. Python 3.13.0

pip install scikit-learn matplotlib

The script below trains a logistic regression model on the breast cancer dataset, then prints and plots the confusion matrix. A Pipeline with StandardScaler keeps the optimisation stable and avoids convergence warnings.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
from pathlib import Path

import matplotlib.pyplot as plt

from sklearn.datasets import load_breast_cancer

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

from sklearn.model_selection import train_test_split

from sklearn.pipeline import make_pipeline

from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42, stratify=y

)

pipeline = make_pipeline(

    StandardScaler(),

    LogisticRegression(max_iter=1000, solver="lbfgs"),

)

pipeline.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)

cm = confusion_matrix(y_test, y_pred)

print(cm)

disp = ConfusionMatrixDisplay(confusion_matrix=cm)

disp.plot(cmap="Blues", colorbar=False)

plt.tight_layout()

plt.show()

Confusion matrix for the breast cancer dataset — Confusion matrix rendered with scikit-learn (Python 3.13)

3. Normalising the matrix #

When the dataset is imbalanced, normalising by row (actual labels) helps you compare error rates.

1
2
3
4
5
6
cm_norm = confusion_matrix(y_test, y_pred, normalize="true")
print(cm_norm)
disp_norm = ConfusionMatrixDisplay(confusion_matrix=cm_norm)
disp_norm.plot(cmap="Blues", values_format=".2f", colorbar=False)
plt.tight_layout()
plt.show()

normalize="true": ratio within each actual class
normalize="pred": ratio within each predicted class
normalize="all": ratio over all observations

4. Extending to multiclass problems #

ConfusionMatrixDisplay.from_predictions automatically builds the matrix for multiclass tasks and adds axis labels.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
ConfusionMatrixDisplay.from_predictions(

    y_true=ground_truth_labels,

    y_pred=model_outputs,

    normalize="true",

    values_format=".2f",

    cmap="Blues",

)

plt.tight_layout()

plt.show()

5. Practical checkpoints #

False negatives vs. false positives: decide which error is more costly (e.g., medical diagnosis vs. fraud detection) and monitor the relevant cells closely.
Pair with heatmaps: visual inspection highlights skewed classes and makes cross-team discussions easier.
Derive other metrics: accuracy, precision, recall, and F1 can all be computed from the same matrix. Compare them with ROC-AUC or PR curves for a fuller picture.
Keep notebooks reproducible: packaging the analysis in a Python 3.13 notebook enables fast iteration when you tune or retrain the model.

Summary #

A confusion matrix summarises TP / FP / FN / TN and exposes the bias of a classifier.
Normalising the matrix reveals error ratios when classes are imbalanced.
Combine the matrix with derived metrics and business requirements to define actionable evaluation criteria.