Random Forest trains many decision trees using bootstrap samples and feature subsampling. It predicts by majority vote (classification) or averaging (regression), reducing variance and improving robustness.
How it works (formulas) #
- Train a tree $h_b(x)$ on each bootstrap sample $\mathcal{D}_b$, for $b=1,\dots,B$.
- Prediction:
- Classification: $\hat y = \operatorname*{arg,max}c \sum{b=1}^B \mathbf{1}[h_b(x)=c]$
- Regression: $\hat y = \tfrac{1}{B}\sum_{b=1}^B h_b(x)$
Split criterion example (Gini): $\mathrm{Gini}(S)=1-\sum_c p(c\mid S)^2$
Train on synthetic data and check ROC-AUC #
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
n_features = 20
X, y = make_classification(
n_samples=2500,
n_features=n_features,
n_informative=10,
n_classes=2,
n_redundant=0,
n_clusters_per_class=4,
random_state=777,
)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=777
)
model = RandomForestClassifier(
n_estimators=50, max_depth=3, random_state=777, bootstrap=True, oob_score=True
)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
rf_score = roc_auc_score(y_test, y_pred)
print(f"ROC-AUC (test) = {rf_score}")
Per-tree performance #
import japanize_matplotlib
estimator_scores = []
for i in range(10):
est = model.estimators_[i]
estimator_scores.append(roc_auc_score(y_test, est.predict(X_test)))
plt.figure(figsize=(10, 4))
bar_index = [i for i in range(len(estimator_scores))]
plt.bar(bar_index, estimator_scores)
plt.bar([10], rf_score)
plt.xticks(bar_index + [10], bar_index + ["RF"])
plt.xlabel("tree index")
plt.ylabel("ROC-AUC")
plt.show()
Feature importance #
Impurity-based importance #
Sum impurity decreases at splits per feature and average over trees.
plt.figure(figsize=(10, 4))
feature_index = [i for i in range(n_features)]
plt.bar(feature_index, model.feature_importances_)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()
Permutation importance #
from sklearn.inspection import permutation_importance
p_imp = permutation_importance(
model, X_train, y_train, n_repeats=10, random_state=77
).importances_mean
plt.figure(figsize=(10, 4))
plt.bar(feature_index, p_imp)
plt.xlabel("feature index")
plt.ylabel("importance")
plt.show()
Visualize trees (optional) #
from sklearn.tree import export_graphviz
from subprocess import call
from IPython.display import Image, display
for i in range(3):
try:
est = model.estimators_[i]
export_graphviz(
est,
out_file=f"tree{i}.dot",
feature_names=[f"x{i}" for i in range(n_features)],
class_names=["A", "B"],
proportion=True,
filled=True,
)
call(["dot", "-Tpng", f"tree{i}.dot", "-o", f"tree{i}.png", "-Gdpi=500"])
display(Image(filename=f"tree{i}.png"))
except Exception:
pass
Hyperparameter tips #
- n_estimators: more trees → more stable, more compute.
- max_depth: deeper → overfit; shallower → underfit.
- max_features: fewer → lower correlation, more diversity.
- bootstrap, oob_score: optional out-of-bag validation.