まとめ
- Get a full picture of regression metrics and how to choose among them.
- Compare error metrics, determination metrics, and interval/probabilistic metrics with code examples.
- Summarise how to assemble a metric set that matches business needs and data characteristics.
Chapter 2 #
Regression metrics overview #
Regression tasks offer many ways to quantify “how close” predictions are to the truth. Each metric emphasises a different perspective—absolute error, squared error, relative error, probabilistic guarantees—so the right choice depends on the use case. This chapter positions the representative metrics and highlights when to use them.
Metric categories #
Error-based metrics #
- Mean Absolute Error (MAE): robust to outliers and expressed in the same unit as the target.
- Root Mean Squared Error (RMSE): penalises large errors heavily; useful when big misses must be avoided.
- Mean Absolute Percentage Error (MAPE) / Weighted Absolute Percentage Error (WAPE): helpful when stakeholders think in percentages (e.g., demand forecasting).
- Mean Absolute Scaled Error (MASE): compares against a naïve seasonal baseline.
- Root Mean Squared Log Error (RMSLE): suitable when targets are close to zero or growth rates matter.
Determination / variance explanation #
- Coefficient of determination (R²): baseline measure of variance explained; can be negative.
- Adjusted R²: compensates for overfitting when features are added.
- Explained variance: focuses on the variance of the residuals.
Interval and probabilistic metrics #
- Pinball loss: evaluates quantile regression / prediction intervals.
- PICP (Prediction Interval Coverage Probability): measures how often prediction intervals capture the true value.
- PINAW (Prediction Interval Normalised Average Width): evaluates how tight the intervals are.
Workflow for choosing metrics #
- Define business objectives
Decide whether absolute error, percentage error, or asymmetric penalties matter more. - Assess data characteristics
Check for values crossing zero, heavy tails/outliers, or strong seasonality. - Establish baselines
Compare against naïve forecasts, simple regressions, or the mean to contextualise improvements. - Evaluate with complementary metrics
Pair MAE with RMSE, or R² with Adjusted R², to cover different viewpoints.
Example: comparing metrics side by side #
import numpy as np
import matplotlib.pyplot as plt
rng = np.random.default_rng(42)
n = 200
y_true = rng.normal(loc=100, scale=15, size=n)
noise = rng.normal(scale=10, size=n)
baseline = y_true.mean() + rng.normal(scale=12, size=n)
model = y_true + noise
robust_model = y_true + np.clip(noise, -8, 8)
def mae(y, y_hat):
return np.mean(np.abs(y - y_hat))
def rmse(y, y_hat):
return np.sqrt(np.mean((y - y_hat) ** 2))
def mape(y, y_hat):
return np.mean(np.abs((y - y_hat) / y)) * 100
scores = {
"MAE": [mae(y_true, baseline), mae(y_true, model), mae(y_true, robust_model)],
"RMSE": [rmse(y_true, baseline), rmse(y_true, model), rmse(y_true, robust_model)],
"MAPE (%)": [mape(y_true, baseline), mape(y_true, model), mape(y_true, robust_model)],
}
labels = ["Baseline", "Model", "Robust"]
x = np.arange(len(labels))
width = 0.25
fig, ax = plt.subplots(figsize=(7, 4.5))
for idx, (metric, values) in enumerate(scores.items()):
ax.bar(x + idx * width, values, width=width, label=metric)
ax.set_xticks(x + width)
ax.set_xticklabels(labels)
ax.set_ylabel("Score")
ax.set_title("Comparison of regression metrics across models")
ax.legend()
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()

Different metrics favour different models. MAE highlights the robust model, while RMSE penalises the baseline that suffers from outliers. Use complementary metrics to ensure the chosen model matches the intended behaviour.
Metric quick reference #
| Category | Metric | Primary use | Caveats |
|---|---|---|---|
| Error | MAE | Outlier-resistant, interpretable units | Not suited when relative error is critical |
| Error | RMSE | Emphasises large errors | Sensitive to outliers |
| Error | RMSLE | Handles wide ranges / growth rates | Undefined for targets ≤ 0 |
| Error | MAPE / WAPE | Percentage-based reporting | Breaks when targets hit zero; penalises underestimation |
| Error | MASE | Compare against naïve seasonal forecasts | Requires proper seasonal period |
| Error | Pinball loss | Quantile / interval evaluation | Needs per-quantile optimisation |
| Determination | R² | Variance explanation | Can be negative |
| Determination | Adjusted R² | Comparing models with different feature counts | Unstable with tiny sample sizes |
| Determination | Explained variance | Focus on residual variance | Scale-dependent; ignores bias |
| Interval | PICP | Coverage of prediction intervals | Inspect interval width simultaneously |
| Interval | PINAW | Tightness of intervals | Interpret alongside coverage |
Operational checklist #
- Before deployment: verify metrics on the full history and inspect residual plots to catch anomalies.
- Monitoring: track metric drift (e.g., MAE, RMSE) and define alert thresholds.
- Visual diagnostics: combine prediction vs. actual charts, residual histograms, and quantile plots.
- Stakeholder communication: translate metrics into business terms (e.g., “average error of ±¥X”) to drive decisions.