LightGBM

中級

2.4.7

LightGBM

Last updated 2020-04-22 Read time 5 min
Summary
  • LightGBM is a fast gradient-boosting library developed by Microsoft that uses histogram approximation and leaf-wise tree growth for high accuracy at reduced training time.
  • It supports native categorical features, GPU training, and distributed learning.
  • For regression tasks with outliers (e.g., real estate pricing), Huber loss (objective="huber") provides robustness by switching between L2 and L1 loss based on residual size.
  • Key hyperparameters: num_leaves, learning_rate, min_data_in_leaf, lambda_l1/l2, and alpha (Huber threshold).

How LightGBM works #

LightGBM is a gradient-boosting framework developed by Microsoft. Two core innovations differentiate it from earlier gradient-boosting implementations:

1. Histogram-based splitting #

Instead of sorting all values to find the best split (O(n log n)), LightGBM bins continuous features into discrete histograms of typically 255 buckets. Split search becomes O(buckets) instead of O(n), dramatically reducing memory and computation.

2. Leaf-wise (best-first) tree growth #

Most frameworks grow trees level-by-level (depth-wise). LightGBM always splits the leaf with the largest loss reduction. This often achieves lower loss with fewer nodes, but requires max_depth or num_leaves constraints to prevent overfitting on small datasets.


Regression with Huber loss #

Standard mean-squared error (MSE / L2 loss) is sensitive to outliers — a single extreme value can dominate the gradient and distort the model. Huber loss is a robust alternative that behaves like L2 loss for small residuals and L1 loss for large ones:

$$ L_\delta(r) = \begin{cases} \frac{1}{2} r^2 & |r| \le \delta \\ \delta \left(|r| - \frac{\delta}{2}\right) & |r| > \delta \end{cases} $$

The threshold \(\delta\) (parameter alpha in LightGBM) controls the transition point. Large residuals (outliers) receive linear rather than quadratic penalty, limiting their influence.

When to use Huber loss #

  • Real estate pricing: property prices have extreme outliers (luxury properties, distressed sales) that inflate L2 gradients.
  • Demand forecasting: promotional spikes or data entry errors create outlier demand values.
  • Any regression with heavy-tailed distributions: income, insurance claims, sales volumes.

Python example: LightGBM with Huber loss #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import lightgbm as lgb
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_absolute_error, root_mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X, y = fetch_california_housing(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

params_l2 = {
    "objective": "regression",      # L2 / MSE loss
    "num_leaves": 63,
    "learning_rate": 0.05,
    "n_estimators": 300,
    "verbose": -1,
}

params_huber = {
    "objective": "huber",           # Huber loss
    "alpha": 0.9,                   # Huber threshold (delta)
    "num_leaves": 63,
    "learning_rate": 0.05,
    "n_estimators": 300,
    "verbose": -1,
}

model_l2 = lgb.LGBMRegressor(**params_l2).fit(X_train, y_train)
model_huber = lgb.LGBMRegressor(**params_huber).fit(X_train, y_train)

for name, model in [("L2", model_l2), ("Huber", model_huber)]:
    pred = model.predict(X_test)
    print(f"{name}  MAE={mean_absolute_error(y_test, pred):.4f}  "
          f"RMSE={root_mean_squared_error(y_test, pred):.4f}")

Key hyperparameters #

ParameterWhat it controlsTypical range
num_leavesMax leaves per tree (complexity)15–255
learning_rateShrinkage per round0.01–0.1
min_data_in_leafMinimum samples per leaf (regularisation)20–500
lambda_l1 / lambda_l2L1/L2 regularisation on leaf weights0–10
feature_fractionFraction of features per tree0.6–1.0
alphaHuber threshold δ (only for objective="huber")0.1–0.99
n_estimatorsNumber of boosting rounds100–3000

Use early stopping (callbacks=[lgb.early_stopping(50)]) and a validation set to pick n_estimators automatically.


FAQ #

What is LightGBM and why is it fast? #

LightGBM is a gradient-boosting framework that uses two speed optimisations: (1) histogram-based feature binning, which reduces split search from O(n) to O(bins), and (2) leaf-wise tree growth, which expands the most promising leaf at each step rather than expanding all leaves at the same depth. Together these make LightGBM significantly faster than older implementations like XGBoost’s exact algorithm, especially on large datasets with many features.

How does Huber loss make LightGBM robust to outliers? #

Huber loss combines the strengths of L2 (smooth gradients, fast convergence near the optimum) and L1 (bounded gradient for large errors). For residuals smaller than the threshold \(\delta\), the gradient is proportional to the residual (L2 behaviour). For large residuals — outliers — the gradient is capped at \(\pm\delta\), preventing a few extreme values from dominating model updates. This makes the model less sensitive to outlier properties in real estate, anomalous demand values, or data entry errors.

How do I choose the alpha (delta) parameter for Huber loss? #

alpha in LightGBM corresponds to \(\delta\) in the Huber formula. It defines the boundary between quadratic and linear penalty:

  • Large alpha (close to 1): most residuals treated as L2 — less robust but lower bias when outliers are rare.
  • Small alpha (close to 0): almost all residuals treated as L1 — very robust but slower convergence.

A practical approach: start with alpha=0.9, then reduce if large errors dominate your validation metrics (e.g., RMSE much larger than MAE). Cross-validate over a grid of values and select based on the metric that matches your business objective.

When should I choose LightGBM over XGBoost or CatBoost? #

LightGBMXGBoostCatBoost
Speed (large data)FastestModerateModerate
Categorical featuresNative (set categorical_feature)Requires encodingBest native support
Leaf-wise growthYesNo (level-wise)No (symmetric trees)
GPU trainingYesYesYes
Overfitting riskHigher with small dataLowerLower

Use LightGBM for large tabular datasets where speed matters. Use CatBoost when you have many high-cardinality categorical features. XGBoost is a reliable all-rounder with mature tooling.

Can LightGBM handle missing values natively? #

Yes. LightGBM can handle NaN values without imputation. During training it learns the optimal direction to send missing values at each split (either left or right child), treating missingness as its own signal. This is useful for real-world datasets where missingness is informative (e.g., a missing price history may indicate a new listing).