Boruta is a feature selection algorithm that keeps only truly useful features. It helps improve accuracy, interpretability, and computational cost by removing irrelevant variables.
1. Why feature selection? #
High-dimensional data
Too many features increase noise and overfitting.Computation cost
Removing useless features speeds up training and prediction.Interpretability
It becomes clearer which features drive model decisions.
2. How Boruta works (intuition) #
- Train a Random Forest with all features.
- Compute feature importance.
- Create shuffled “shadow features” as a baseline.
- If a real feature beats its shadow, mark it important; otherwise unimportant.
- Repeat to stabilise the selection.
3. Example (CSV data) #
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
# Load data
X = pd.read_csv("examples/test_X.csv", index_col=0).values
y = pd.read_csv("examples/test_y.csv", header=None, index_col=0).values.ravel()
# Random Forest
rf = RandomForestClassifier(n_jobs=-1, class_weight="balanced", max_depth=5)
# Boruta
feat_selector = BorutaPy(rf, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)
print("Selected features:", feat_selector.support_)
print("Feature ranking:", feat_selector.ranking_)
# Keep only important features
X_filtered = feat_selector.transform(X)
4. Experiments with synthetic data #
We check whether Boruta keeps useful features and removes irrelevant ones.
All features are useful (no removal) #
from sklearn.datasets import make_classification
from xgboost import XGBClassifier
X, y = make_classification(
n_samples=1000, n_features=10,
n_informative=10, n_redundant=0, n_classes=2,
random_state=0, shuffle=False
)
model = XGBClassifier(max_depth=4)
feat_selector = BorutaPy(model, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)
X_filtered = feat_selector.transform(X)
print(f"{X.shape[1]} --> {X_filtered.shape[1]}")
If all features are useful, none are removed.
Many irrelevant features (remove them) #
X, y = make_classification(
n_samples=2000, n_features=100,
n_informative=10, n_redundant=0, n_classes=2,
random_state=0, shuffle=False
)
model = XGBClassifier(max_depth=5)
feat_selector = BorutaPy(model, n_estimators="auto", verbose=2, random_state=1)
feat_selector.fit(X, y)
X_filtered = feat_selector.transform(X)
print(f"{X.shape[1]} --> {X_filtered.shape[1]}")
10 useful out of 100 → Boruta keeps the 10 and removes the rest.
5. Practical notes #
- Works especially well with tree-based models (Random Forest, XGBoost).
- Removes noisy features reliably.
- Computationally heavier when the feature count is large.
- The selected features can be used for interpretation and visualisation.
Summary #
- Boruta compares real features to “shadow” features to select stable, useful variables.
- It keeps everything if all features are useful, and removes the rest when needed.
- As a preprocessing step, it improves accuracy, efficiency, and interpretability.