CatBoost | Gradient Boosting ที่รักคอลัมน์เชิงหมวด

Created: 2019-06-05 Last updated: 2020-05-06 Read time: 2 min

CatBoost จาก Yandex เป็น Gradient Boosting ที่ออกแบบมาเพื่อคอลัมน์เชิงหมวด (categorical) โดยเฉพาะ ใช้ Ordered Target Statistics (ลด leakage) และต้นไม้แบบ symmetric (Oblivious Tree) ทำให้ฝึกเร็ว เสถียร และต้องเตรียมข้อมูลน้อย

รองรับ missing value, class weight, text feature พร้อมพร้อม และมักได้ผลดีตั้งแต่ค่า default

จุดเด่น #

การแปลงคอลัมน์หมวด: ใช้ target statistics แบบ online ช่วยเลี่ยงการรั่วของข้อมูล (ไม่ต้อง one-hot เอง)
Oblivious Tree: ทุกระดับของต้นไม้ใช้ฟีเจอร์และ threshold เดียวกัน ทำให้ inference เร็วและใช้เมมโมรีคงที่
Ordered Boosting: สร้างชุดข้อมูลย่อยระหว่างการฝึกเพื่อลด bias/variance และป้องกัน overfitting
ฟีเจอร์ครบ: สนับสนุน class weight, monotonic constraint, text feature, prediction explanation

โค้ดตัวอย่าง (classification) #

from catboost import CatBoostClassifier, Pool
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

data = fetch_openml(name="credit-g", version=1, as_frame=True)
X = data.data
y = (data.target == "good").astype(int)

cat_cols = X.select_dtypes(include="category").columns.tolist()

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

train_pool = Pool(X_train, label=y_train, cat_features=cat_cols)
valid_pool = Pool(X_valid, label=y_valid, cat_features=cat_cols)

model = CatBoostClassifier(
    depth=6,
    iterations=1000,
    learning_rate=0.03,
    loss_function="Logloss",
    eval_metric="AUC",
    random_seed=42,
    early_stopping_rounds=50,
    verbose=100,
)
model.fit(train_pool, eval_set=valid_pool, use_best_model=True)

proba = model.predict_proba(X_valid)[:, 1]
pred = (proba >= 0.5).astype(int)
print("ROC-AUC:", roc_auc_score(y_valid, proba))
print(classification_report(y_valid, pred, digits=3))

เพียงระบุชื่อคอลัมน์เชิงหมวดให้ Pool ก็ใช้งาน Target Encoding แบบไม่รั่วข้อมูลได้ทันที

พารามิเตอร์สำคัญ #

พารามิเตอร์	บทบาท
`depth`	ความลึกของต้นไม้ symmetric (ค่าทั่วไป 6–10)
`iterations`	จำนวนรอบ boosting ใช้ `early_stopping_rounds` เพื่อป้องกัน overfitting
`learning_rate`	ยิ่งเล็กยิ่งนิ่ง ต้องเพิ่ม iterations
`l2_leaf_reg`	ควบคุมความเรียบของโมเดล
`bagging_temperature`	ระดับสุ่มของการ bootstrap ข้อมูล
`border_count`	จำนวน bin สำหรับฟีเจอร์ต่อเนื่อง

การตีความ #

importance = model.get_feature_importance(type="PredictionValuesChange")
for name, score in sorted(zip(X.columns, importance), key=lambda x: -x[1])[:10]:
    print(f"{name}: {score:.3f}")

shap_values = model.get_feature_importance(valid_pool, type="ShapValues")

ใช้ PredictionValuesChange เพื่อดูว่าแต่ละฟีเจอร์มีผลเพิ่ม/ลดคะแนนเท่าไร หรือดึง SHAP value เพื่ออธิบายรายตัวอย่าง

เมื่อไหร่ควรใช้ CatBoost #

ตารางข้อมูลที่มี categorical เยอะ/ซับซ้อน
ต้องการโมเดลที่ใช้ง่าย ค่า default ดี และใช้ GPU ได้
อยากลดภาระ one-hot/manual encoding
ใช้ร่วมกับโมเดลอื่นใน stacking เพื่อเพิ่มความหลากหลาย

สรุป #

CatBoost = Gradient Boosting ที่เน้นคอลัมน์เชิงหมวดและความเสถียร
ปรับสามพารามิเตอร์หลัก depth, iterations, learning_rate แล้ว fine-tune l2_leaf_reg
มีเครื่องมืออธิบายผล (feature importance, SHAP) และรองรับ text feature
เป็นตัวเลือกแรก ๆ เมื่อเจอข้อมูลตารางหลากหลายประเภท