Random Forest | Bagging Decision Trees for Robust Predictions

Created: 2019-02-02 Last updated: 2020-01-29 Read time: 2 min

Random Forest trains many decision trees using bootstrap samples and feature subsampling. It predicts by majority vote (classification) or averaging (regression), reducing variance and improving robustness.

How it works (formulas) #

Train a tree \(h_b(x)\) on each bootstrap sample \(\mathcal{D}_b\), for \(b=1,\dots,B\).
Prediction:
- Classification: \(\hat y = \operatorname*{arg,max}c \sum{b=1}^B \mathbf{1}[h_b(x)=c]\)
- Regression: \(\hat y = \tfrac{1}{B}\sum_{b=1}^B h_b(x)\)

Split criterion example (Gini): \(\mathrm{Gini}(S)=1-\sum_c p(c\mid S)^2\)

Train on synthetic data and check ROC-AUC #

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

n_features = 20
X, y = make_classification(
    n_samples=2500,
    n_features=n_features,
    n_informative=10,
    n_classes=2,
    n_redundant=0,
    n_clusters_per_class=4,
    random_state=777,
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=777
)

model = RandomForestClassifier(
    n_estimators=50, max_depth=3, random_state=777, bootstrap=True, oob_score=True
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rf_score = roc_auc_score(y_test, y_pred)
print(f"ROC-AUC (test) = {rf_score}")

Train on synthetic data and check ROC-AUC figure

Per-tree performance #

import japanize_matplotlib

estimator_scores = []
for i in range(10):
    est = model.estimators_[i]
    estimator_scores.append(roc_auc_score(y_test, est.predict(X_test)))

plt.figure(figsize=(10, 4))
bar_index = [i for i in range(len(estimator_scores))]
plt.bar(bar_index, estimator_scores)
plt.bar([10], rf_score)
plt.xticks(bar_index + [10], bar_index + ["RF"])
plt.xlabel("tree index")
plt.ylabel("ROC-AUC")
plt.show()

Per-tree performance figure

Feature importance #

Impurity-based importance #

Sum impurity decreases at splits per feature and average over trees.

plt.figure(figsize=(10, 4))
feature_index = [i for i in range(n_features)]
plt.bar(feature_index, model.feature_importances_)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()

Sum impurity decreases at splits per feature and average ove… figure

Permutation importance #

permutation_importance

from sklearn.inspection import permutation_importance

p_imp = permutation_importance(
    model, X_train, y_train, n_repeats=10, random_state=77
).importances_mean

plt.figure(figsize=(10, 4))
plt.bar(feature_index, p_imp)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()

Permutation importance figure

Visualize trees (optional) #

from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image, display

for i in range(3):
    try:
        est = model.estimators_[i]
        export_graphviz(
            est,
            out_file=f"tree{i}.dot",
            feature_names=[f"x{i}" for i in range(n_features)],
            class_names=["A", "B"],
            proportion=True,
            filled=True,
        )
        call(["dot", "-Tpng", f"tree{i}.dot", "-o", f"tree{i}.png", "-Gdpi=500"])
        display(Image(filename=f"tree{i}.png"))
    except Exception:
        pass

Visualize trees (optional) figure

Hyperparameter tips #

n_estimators: more trees → more stable, more compute.
max_depth: deeper → overfit; shallower → underfit.
max_features: fewer → lower correlation, more diversity.
bootstrap, oob_score: optional out-of-bag validation.