CatBoost

Basic

CatBoost | Gradient boosting that shines with categorical features

Created: Last updated: Read time: 3 min

CatBoost is a gradient-boosted tree model developed by Yandex that handles categorical features extremely well. With ordered target statistics and symmetric (oblivious) trees, it delivers strong accuracy with minimal preprocessing.

It converges quickly even with small learning rates and is robust to missing values and distribution shifts, which makes it popular in both production and competitions.


1. How CatBoost works #

  • Categorical encoding
    Uses Ordered Target Statistics by shuffling data and updating category means on the fly, avoiding leakage and the drawbacks of one-hot or label encoding.

  • Oblivious trees
    Each level uses the same feature and threshold, creating exactly (2^d) leaves at depth (d). This structure is GPU-friendly and fast at inference time.

  • Ordered boosting
    Gradients are estimated with different permutations to reduce overfitting and improve stability compared with standard boosting.

  • Rich feature set
    Supports class weighting, text features, monotonic constraints, and blended metrics.


2. Train a classifier in Python #

from catboost import CatBoostClassifier, Pool
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

# Credit dataset (mixed categorical and numeric)
data = fetch_openml(name="credit-g", version=1, as_frame=True)
X = data.data
y = (data.target == "good").astype(int)

categorical_features = X.select_dtypes(include="category").columns.tolist()

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

train_pool = Pool(
    X_train,
    label=y_train,
    cat_features=categorical_features,
)
valid_pool = Pool(X_valid, label=y_valid, cat_features=categorical_features)

model = CatBoostClassifier(
    depth=6,
    iterations=1000,
    learning_rate=0.03,
    loss_function="Logloss",
    eval_metric="AUC",
    random_seed=42,
    early_stopping_rounds=50,
    verbose=100,
)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

proba = model.predict_proba(X_valid)[:, 1]
pred = (proba >= 0.5).astype(int)
print("ROC-AUC:", roc_auc_score(y_valid, proba))
print(classification_report(y_valid, pred, digits=3))

Passing categorical columns to Pool is enough; target encoding and ordering are applied internally.


3. Key hyperparameters #

ParameterRole / tuning tips
depthTree depth. With oblivious trees, depth=6 yields 64 leaves. Deeper trees increase capacity but risk overfitting.
iterationsNumber of boosting rounds. Use with early_stopping_rounds.
learning_rateSmaller values often improve accuracy but need more iterations.
l2_leaf_regL2 regularisation on leaves; higher values smooth the model.
border_countNumber of bins for numeric features (default 254). Fewer bins are faster but less precise.
bagging_temperatureControls row sampling randomness; near 0 is deterministic.
class_weightsDirectly set class weights for imbalanced data.

4. Feature importance and SHAP #

importance = model.get_feature_importance(type="PredictionValuesChange")
for name, score in sorted(zip(X.columns, importance), key=lambda x: -x[1])[:10]:
    print(f"{name}: {score:.3f}")

shap_values = model.get_feature_importance(valid_pool, type="ShapValues")
# shap_values[:, -1] is the baseline; matplotlib can plot summaries.
  • PredictionValuesChange measures how much predictions change; the sign indicates direction.
  • SHAP values are available via type="ShapValues" and are useful for per-sample explanation.

5. CatBoost-specific techniques #

  • Feature combinations: one_hot_max_size and combinations_ctypes build interactions between categorical values.
  • Text features: Set text_features and configure text_processing (TF-IDF, BM25).
  • Monotonic constraints: Use monotone_constraints to enforce monotonicity for pricing or risk models.
  • Built-in CV: cv runs cross-validation tuned for ordered boosting.

6. When to choose CatBoost #

  • Mostly categorical data and minimal preprocessing → CatBoost is a strong first choice.
  • Mostly numeric data and need to iterate quickly → LightGBM with histogram features is often faster.
  • Large sparse data → XGBoost’s DMatrix can be memory efficient.
  • In ensembles, adding CatBoost can improve robustness on categorical features.

7. Summary #

  • CatBoost handles categorical features automatically and stabilises training with ordered boosting.
  • Balancing depth, iterations, and learning_rate, plus l2_leaf_reg, is key for performance.
  • Its SHAP and text support make it a versatile member of an ensemble stack.