CatBoost is a gradient-boosted tree model developed by Yandex that handles categorical features extremely well. With ordered target statistics and symmetric (oblivious) trees, it delivers strong accuracy with minimal preprocessing.
It converges quickly even with small learning rates and is robust to missing values and distribution shifts, which makes it popular in both production and competitions.
1. How CatBoost works #
Categorical encoding
Uses Ordered Target Statistics by shuffling data and updating category means on the fly, avoiding leakage and the drawbacks of one-hot or label encoding.Oblivious trees
Each level uses the same feature and threshold, creating exactly (2^d) leaves at depth (d). This structure is GPU-friendly and fast at inference time.Ordered boosting
Gradients are estimated with different permutations to reduce overfitting and improve stability compared with standard boosting.Rich feature set
Supports class weighting, text features, monotonic constraints, and blended metrics.
2. Train a classifier in Python #
from catboost import CatBoostClassifier, Pool
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
# Credit dataset (mixed categorical and numeric)
data = fetch_openml(name="credit-g", version=1, as_frame=True)
X = data.data
y = (data.target == "good").astype(int)
categorical_features = X.select_dtypes(include="category").columns.tolist()
X_train, X_valid, y_train, y_valid = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
train_pool = Pool(
X_train,
label=y_train,
cat_features=categorical_features,
)
valid_pool = Pool(X_valid, label=y_valid, cat_features=categorical_features)
model = CatBoostClassifier(
depth=6,
iterations=1000,
learning_rate=0.03,
loss_function="Logloss",
eval_metric="AUC",
random_seed=42,
early_stopping_rounds=50,
verbose=100,
)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)
proba = model.predict_proba(X_valid)[:, 1]
pred = (proba >= 0.5).astype(int)
print("ROC-AUC:", roc_auc_score(y_valid, proba))
print(classification_report(y_valid, pred, digits=3))
Passing categorical columns to Pool is enough; target encoding and ordering are
applied internally.
3. Key hyperparameters #
| Parameter | Role / tuning tips |
|---|---|
depth | Tree depth. With oblivious trees, depth=6 yields 64 leaves. Deeper trees increase capacity but risk overfitting. |
iterations | Number of boosting rounds. Use with early_stopping_rounds. |
learning_rate | Smaller values often improve accuracy but need more iterations. |
l2_leaf_reg | L2 regularisation on leaves; higher values smooth the model. |
border_count | Number of bins for numeric features (default 254). Fewer bins are faster but less precise. |
bagging_temperature | Controls row sampling randomness; near 0 is deterministic. |
class_weights | Directly set class weights for imbalanced data. |
4. Feature importance and SHAP #
importance = model.get_feature_importance(type="PredictionValuesChange")
for name, score in sorted(zip(X.columns, importance), key=lambda x: -x[1])[:10]:
print(f"{name}: {score:.3f}")
shap_values = model.get_feature_importance(valid_pool, type="ShapValues")
# shap_values[:, -1] is the baseline; matplotlib can plot summaries.
PredictionValuesChangemeasures how much predictions change; the sign indicates direction.- SHAP values are available via
type="ShapValues"and are useful for per-sample explanation.
5. CatBoost-specific techniques #
- Feature combinations:
one_hot_max_sizeandcombinations_ctypesbuild interactions between categorical values. - Text features: Set
text_featuresand configuretext_processing(TF-IDF, BM25). - Monotonic constraints: Use
monotone_constraintsto enforce monotonicity for pricing or risk models. - Built-in CV:
cvruns cross-validation tuned for ordered boosting.
6. When to choose CatBoost #
- Mostly categorical data and minimal preprocessing → CatBoost is a strong first choice.
- Mostly numeric data and need to iterate quickly → LightGBM with histogram features is often faster.
- Large sparse data → XGBoost’s
DMatrixcan be memory efficient. - In ensembles, adding CatBoost can improve robustness on categorical features.
7. Summary #
- CatBoost handles categorical features automatically and stabilises training with ordered boosting.
- Balancing
depth,iterations, andlearning_rate, plusl2_leaf_reg, is key for performance. - Its SHAP and text support make it a versatile member of an ensemble stack.