Random Forest

最終更新: 2 分で読めます このページを編集

Random Forest trains many decision trees using bootstrap samples and feature subsampling. It predicts by majority vote (classification) or averaging (regression), reducing variance and improving robustness.

How it works (formulas) #

  • Train a tree $h_b(x)$ on each bootstrap sample $\mathcal{D}_b$, for $b=1,\dots,B$.
  • Prediction:
    • Classification: $\hat y = \operatorname*{arg,max}c \sum{b=1}^B \mathbf{1}[h_b(x)=c]$
    • Regression: $\hat y = \tfrac{1}{B}\sum_{b=1}^B h_b(x)$

Split criterion example (Gini): $\mathrm{Gini}(S)=1-\sum_c p(c\mid S)^2$


Train on synthetic data and check ROC-AUC #

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

n_features = 20
X, y = make_classification(
    n_samples=2500,
    n_features=n_features,
    n_informative=10,
    n_classes=2,
    n_redundant=0,
    n_clusters_per_class=4,
    random_state=777,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=777
)

model = RandomForestClassifier(
    n_estimators=50, max_depth=3, random_state=777, bootstrap=True, oob_score=True
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rf_score = roc_auc_score(y_test, y_pred)
print(f"ROC-AUC (test) = {rf_score}")

png


Per-tree performance #

import japanize_matplotlib

estimator_scores = []
for i in range(10):
    est = model.estimators_[i]
    estimator_scores.append(roc_auc_score(y_test, est.predict(X_test)))

plt.figure(figsize=(10, 4))
bar_index = [i for i in range(len(estimator_scores))]
plt.bar(bar_index, estimator_scores)
plt.bar([10], rf_score)
plt.xticks(bar_index + [10], bar_index + ["RF"])
plt.xlabel("tree index")
plt.ylabel("ROC-AUC")
plt.show()

png


Feature importance #

Impurity-based importance #

Sum impurity decreases at splits per feature and average over trees.

plt.figure(figsize=(10, 4))
feature_index = [i for i in range(n_features)]
plt.bar(feature_index, model.feature_importances_)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()

png

Permutation importance #

from sklearn.inspection import permutation_importance

p_imp = permutation_importance(
    model, X_train, y_train, n_repeats=10, random_state=77
).importances_mean

plt.figure(figsize=(10, 4))
plt.bar(feature_index, p_imp)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()

png


Visualize trees (optional) #

from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image, display

for i in range(3):
    try:
        est = model.estimators_[i]
        export_graphviz(
            est,
            out_file=f"tree{i}.dot",
            feature_names=[f"x{i}" for i in range(n_features)],
            class_names=["A", "B"],
            proportion=True,
            filled=True,
        )
        call(["dot", "-Tpng", f"tree{i}.dot", "-o", f"tree{i}.png", "-Gdpi=500"])
        display(Image(filename=f"tree{i}.png"))
    except Exception:
        pass

png


Hyperparameter tips #

  • n_estimators: more trees → more stable, more compute.
  • max_depth: deeper → overfit; shallower → underfit.
  • max_features: fewer → lower correlation, more diversity.
  • bootstrap, oob_score: optional out-of-bag validation.