Nested cross-validation

Created: 2019-05-01 Last updated: 2020-04-08 Read time: 2 min

まとめ

Nested cross-validation separates hyperparameter search from outer validation to avoid optimistic bias.
Build outer/inner loops in scikit-learn and inspect how leakage is prevented in code.
Understand when the extra compute is worth it and how to communicate the results.

1. How it works #

Outer loop — split the dataset with K_outer; each fold acts as an untouched test set.
Inner loop — run K_inner-fold cross-validation on the remaining data to tune hyperparameters (grid search, random search, Bayesian optimisation, …).
Evaluation — retrain using the best hyperparameters and score on the held-out outer fold.

Repeat this procedure K_outer times and aggregate the outer scores (mean ± standard deviation or confidence intervals).

2. Implementation in Python #

from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [None, 10, 20],
}
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)

grid = GridSearchCV(
    RandomForestClassifier(random_state=0),
    param_grid=param_grid,
    cv=inner_cv,
    scoring="roc_auc",
    n_jobs=-1,
)

scores = cross_val_score(grid, X, y, cv=outer_cv, scoring="roc_auc", n_jobs=-1)
print("Nested CV ROC-AUC:", scores.mean(), "+/-", scores.std())

Passing the search object (GridSearchCV, RandomizedSearchCV, etc.) into cross_val_score is all it takes to run nested cross-validation.

3. Benefits #

Prevents leakage between tuning and evaluation: avoids the optimistic bias that occurs when the same data are used for both.
Fair model comparison: every candidate model is tuned independently, yet evaluated under identical outer folds.
Uncertainty estimates: outer-loop scores provide a distribution from which you can compute variance and confidence intervals.

4. Caveats #

Computationally expensive — training cost scales with K_outer × K_inner. Heavy models can become impractical.
Beware huge grids — large search spaces multiply the runtime; consider random search or Bayesian optimisation.
Small datasets — if data are extremely scarce, nested CV may leave too little for training; adjust K or use repeated CV carefully.

5. Practical tips #

Great for small datasets: leakage is more harmful when data are scarce, so nested CV pays off.
Use for top contenders: run nested CV only for shortlisted models to keep compute manageable.
Report your settings: document the fold counts, search strategy, and random seeds to demonstrate risk control.

Summary #

Nested cross-validation delivers an unbiased estimate of generalisation while performing hyperparameter optimisation.
Combine GridSearchCV (or similar) with cross_val_score; plan for the additional compute.
Use it for the final evaluation of critical models so the reported metrics reflect real-world performance.