Boruta

Basic

Boruta | Selecting truly important features with Random Forest

Created: Last updated: Read time: 2 min

Boruta is a feature selection algorithm that keeps only truly useful features. It helps improve accuracy, interpretability, and computational cost by removing irrelevant variables.


1. Why feature selection? #

  • High-dimensional data
    Too many features increase noise and overfitting.

  • Computation cost
    Removing useless features speeds up training and prediction.

  • Interpretability
    It becomes clearer which features drive model decisions.


2. How Boruta works (intuition) #

  1. Train a Random Forest with all features.
  2. Compute feature importance.
  3. Create shuffled “shadow features” as a baseline.
  4. If a real feature beats its shadow, mark it important; otherwise unimportant.
  5. Repeat to stabilise the selection.

3. Example (CSV data) #

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

# Load data
X = pd.read_csv("examples/test_X.csv", index_col=0).values
y = pd.read_csv("examples/test_y.csv", header=None, index_col=0).values.ravel()

# Random Forest
rf = RandomForestClassifier(n_jobs=-1, class_weight="balanced", max_depth=5)

# Boruta
feat_selector = BorutaPy(rf, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)

print("Selected features:", feat_selector.support_)
print("Feature ranking:", feat_selector.ranking_)

# Keep only important features
X_filtered = feat_selector.transform(X)

4. Experiments with synthetic data #

We check whether Boruta keeps useful features and removes irrelevant ones.

All features are useful (no removal) #

from sklearn.datasets import make_classification
from xgboost import XGBClassifier

X, y = make_classification(
    n_samples=1000, n_features=10,
    n_informative=10, n_redundant=0, n_classes=2,
    random_state=0, shuffle=False
)
model = XGBClassifier(max_depth=4)

feat_selector = BorutaPy(model, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)
X_filtered = feat_selector.transform(X)

print(f"{X.shape[1]} --> {X_filtered.shape[1]}")

If all features are useful, none are removed.


Many irrelevant features (remove them) #

X, y = make_classification(
    n_samples=2000, n_features=100,
    n_informative=10, n_redundant=0, n_classes=2,
    random_state=0, shuffle=False
)
model = XGBClassifier(max_depth=5)

feat_selector = BorutaPy(model, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)
X_filtered = feat_selector.transform(X)

print(f"{X.shape[1]} --> {X_filtered.shape[1]}")

10 useful out of 100 → Boruta keeps the 10 and removes the rest.


5. Practical notes #

  • Works especially well with tree-based models (Random Forest, XGBoost).
  • Removes noisy features reliably.
  • Computationally heavier when the feature count is large.
  • The selected features can be used for interpretation and visualisation.

Summary #

  • Boruta compares real features to “shadow” features to select stable, useful variables.
  • It keeps everything if all features are useful, and removes the rest when needed.
  • As a preprocessing step, it improves accuracy, efficiency, and interpretability.