まとめ

Organise the key validation strategies and information criteria used when comparing models.
Take a birds-eye view of cross-validation, validation/learning curves, information criteria, and hyperparameter search.
Summarise how to build an evaluation workflow that balances data constraints, cost, and decision-making.

Chapter 1 #

Model Selection at a Glance #

Choosing a model is about more than picking the best score on a hold-out split. You must decide how to estimate generalisation performance, which metrics to trust, and how to keep computation under control. This chapter brings together the core validation techniques shared across regression and classification tasks and offers guidance on when to reach for each tool.

Core Techniques #

1. Data splits and cross-validation #

K-Fold / Stratified K-Fold (see Cross-validation / Stratified K-Fold): the default choice when data are limited. Stratification preserves label balance for classification.
Nested cross-validation (see Nested CV): best practice when you need an unbiased estimate that includes hyperparameter search.
Time-series splits: required when temporal order matters—combine expanding or sliding windows with domain knowledge.

2. Curves for visual diagnosis #

Validation curves (Validation Curve): reveal how a hyperparameter affects training/validation scores.
Learning curves (Learning Curve): show how performance scales with sample size, clarifying the value of collecting more data.

3. Information criteria #

AIC / BIC (AIC & BIC): penalise model complexity in Gaussian or generalised linear models.
Mallow’s (C_p) and related statistics: useful when a closed-form estimator of prediction error is available.

4. Hyperparameter search #

Grid / random search: exhaustive over small spaces, stochastic for quick wins in larger spaces.
Bayesian optimisation / Hyperband: data-efficient choices when evaluations are expensive.
AutoML pipelines: automate model, feature, and hyperparameter search when you need an end-to-end baseline.

Cross-validation comparison #

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

X, y = load_breast_cancer(return_X_y=True)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
models = {
    "LogReg (L2)": make_pipeline(
        StandardScaler(),
        LogisticRegression(max_iter=2000, penalty="l2", C=1.0, solver="lbfgs"),
    ),
    "LogReg (ElasticNet)": make_pipeline(
        StandardScaler(),
        LogisticRegression(
            max_iter=2000,
            penalty="elasticnet",
            solver="saga",
            C=1.0,
            l1_ratio=0.4,
        ),
    ),
    "RandomForest": RandomForestClassifier(
        n_estimators=200, max_depth=6, random_state=42
    ),
}

means = []
stds = []
labels = []
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=cv, scoring="roc_auc")
    means.append(scores.mean())
    stds.append(scores.std())
    labels.append(name)

y_pos = np.arange(len(labels))

fig, ax = plt.subplots(figsize=(6.5, 3.8))
ax.barh(y_pos, means, xerr=stds, color="#2563eb", alpha=0.8)
ax.set_yticks(y_pos)
ax.set_yticklabels(labels)
ax.set_xlabel("ROC-AUC (5-fold mean ± std)")
ax.set_xlim(0.9, 1.0)
ax.set_title("Cross-validation comparison across models")
ax.grid(axis="x", alpha=0.3)
plt.tight_layout()

Comparing models with 5-fold ROC-AUC — Cross-validation exposes both the mean performance and the variance. Here, elastic-net regularisation edges out the baseline while keeping variance in check.

Building an evaluation workflow #

Understand the data distribution
Handle label imbalance, temporal correlations, and leakage before choosing splits.
Make the comparison space explicit
List the models, pre-processing steps, and feature sets you plan to evaluate.
Align on metrics and thresholds
Agree with stakeholders on which metrics drive the decision (ROC-AUC, RMSE, cost, etc.).
Ensure reproducibility
Track random seeds, split definitions, and environment versions. Automate whenever possible.
Budget computation
Estimate runtime for grid/random search, then refine with adaptive or Bayesian methods once promising regions emerge.

Quick reference #

Topic	Related pages	Notes
Cross-validation basics	Cross-validation / Stratified K-Fold	Overview of splitting strategies
Nested validation	Nested CV	Unbiased estimates with hyperparameter tuning
Curve-based diagnostics	Learning Curve / Validation Curve	Visualise data sufficiency and hyperparameter effects
Information criteria	AIC & BIC	Compare parametric models with complexity penalties

Checklist #

Documented how the data are split (stratified, time-series, etc.)
Chosen the primary metrics and reporting format
Defined the hyperparameter search space and stages
Shared reproducible code/configuration for every experiment
Considered constraints such as inference time and model size