XGBoost

Basic

XGBoost | Fast and accurate gradient boosting

Created: Last updated: Read time: 2 min

XGBoost (eXtreme Gradient Boosting) is a gradient boosting implementation that focuses on regularisation and speed. It offers rich features such as missing-value handling, tree optimisations, and parallel training, making it a staple in competitions and production.


1. Key characteristics #

  • Regularised loss: L1/L2 regularisation reduces overfitting.
  • Default direction for missing values: missing values are routed automatically.
  • Parallelisation: tree construction is parallelised by blocks for fast training.
  • Advanced parameters: fine-grained control over depth, leaves, and sampling.

2. Training with the xgboost package #

import xgboost as xgb
from sklearn.metrics import mean_absolute_error

dtrain = xgb.DMatrix(X_train, label=y_train)
dvalid = xgb.DMatrix(X_valid, label=y_valid)

params = {
    "objective": "reg:squarederror",
    "eval_metric": "rmse",
    "max_depth": 6,
    "eta": 0.05,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "lambda": 1.0,
}

evals = [(dtrain, "train"), (dvalid, "valid")]
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=1000,
    evals=evals,
    early_stopping_rounds=50,
)

pred = bst.predict(xgb.DMatrix(X_test), iteration_range=(0, bst.best_iteration + 1))
print("MAE:", mean_absolute_error(y_test, pred))

Setting early_stopping_rounds stops training when validation no longer improves and selects the best iteration automatically.


3. Main hyperparameters #

ParameterRoleTuning tip
etaLearning rateSmaller values are more stable but need more rounds
max_depthTree depthDeeper trees are expressive but can overfit
min_child_weightMinimum sum of weights in a childIncrease for noisier data
subsample / colsample_bytreeSampling ratios0.6–0.9 often improves generalisation
lambda, alphaL2 / L1 regularisationLarger values reduce overfitting; use alpha for sparsity

4. Practical usage #

  • Structured data: strong performance on encoded tabular data.
  • Missing values: missing values are handled internally.
  • Feature importance: Gain/Weight/Cover metrics available.
  • SHAP: works with xgboost.to_graphviz and shap.TreeExplainer for interpretability.

5. Extra tips #

  • Lower the learning rate (e.g., 0.1 → 0.02) while increasing rounds to boost accuracy.
  • tree_method: choose "hist" for speed, "gpu_hist" for GPU, "approx" for large data.
  • Cross-validation: use xgb.cv with early_stopping_rounds to estimate optimal rounds.

Summary #

  • XGBoost combines regularisation, missing handling, and speed for strong results on tabular data.
  • Tune eta, max_depth, min_child_weight, sampling, and regularisation together.
  • Choose between XGBoost, LightGBM, and CatBoost based on data characteristics.