まとめ
- Nested cross-validation separates hyperparameter search from outer validation to avoid optimistic bias.
- Build outer/inner loops in scikit-learn and inspect how leakage is prevented in code.
- Understand when the extra compute is worth it and how to communicate the results.
1. How it works #
- Outer loop — split the dataset with
K_outer; each fold acts as an untouched test set. - Inner loop — run
K_inner-fold cross-validation on the remaining data to tune hyperparameters (grid search, random search, Bayesian optimisation, …). - Evaluation — retrain using the best hyperparameters and score on the held-out outer fold.
Repeat this procedure K_outer times and aggregate the outer scores (mean ± standard deviation or confidence intervals).
2. Implementation in Python #
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from sklearn.ensemble import RandomForestClassifier
param_grid = {
"n_estimators": [100, 200],
"max_depth": [None, 10, 20],
}
inner_cv = KFold(n_splits=3, shuffle=True, random_state=0)
outer_cv = KFold(n_splits=5, shuffle=True, random_state=1)
grid = GridSearchCV(
RandomForestClassifier(random_state=0),
param_grid=param_grid,
cv=inner_cv,
scoring="roc_auc",
n_jobs=-1,
)
scores = cross_val_score(grid, X, y, cv=outer_cv, scoring="roc_auc", n_jobs=-1)
print("Nested CV ROC-AUC:", scores.mean(), "+/-", scores.std())
Passing the search object (GridSearchCV, RandomizedSearchCV, etc.) into cross_val_score is all it takes to run nested cross-validation.
3. Benefits #
- Prevents leakage between tuning and evaluation: avoids the optimistic bias that occurs when the same data are used for both.
- Fair model comparison: every candidate model is tuned independently, yet evaluated under identical outer folds.
- Uncertainty estimates: outer-loop scores provide a distribution from which you can compute variance and confidence intervals.
4. Caveats #
- Computationally expensive — training cost scales with
K_outer × K_inner. Heavy models can become impractical. - Beware huge grids — large search spaces multiply the runtime; consider random search or Bayesian optimisation.
- Small datasets — if data are extremely scarce, nested CV may leave too little for training; adjust
Kor use repeated CV carefully.
5. Practical tips #
- Great for small datasets: leakage is more harmful when data are scarce, so nested CV pays off.
- Use for top contenders: run nested CV only for shortlisted models to keep compute manageable.
- Report your settings: document the fold counts, search strategy, and random seeds to demonstrate risk control.
Summary #
- Nested cross-validation delivers an unbiased estimate of generalisation while performing hyperparameter optimisation.
- Combine
GridSearchCV(or similar) withcross_val_score; plan for the additional compute. - Use it for the final evaluation of critical models so the reported metrics reflect real-world performance.