Feature selection keeps only the truly useful features from a large set. Removing irrelevant features helps prevent overfitting, speeds up training, and improves interpretability.
1. Why feature selection? #
- Avoid overfitting: noisy features can hurt model accuracy.
- Improve efficiency: fewer features means faster training and inference.
- Better interpretability: it becomes clearer what the model relies on.
2. Three major approaches #
2.1 Filter methods #
- Rank features using statistical criteria.
- Lightweight because no model training is required.
Examples:
- Correlation
- Chi-square test (χ²)
- Mutual information
from sklearn.feature_selection import chi2, SelectKBest
X_new = SelectKBest(score_func=chi2, k=5).fit_transform(X, y)
2.2 Wrapper methods #
- Train models and select features based on performance.
- Common examples:
- Sequential Forward Selection (SFS)
- Sequential Backward Selection (SBS)
⚠️ Note
- SFS/SBS involve repeated training and human decisions, which can introduce bias and overfitting.
- They also become expensive on large datasets.
2.3 Embedded methods #
- Use feature importance obtained during model training.
- Efficient because selection happens in a single run.
Examples:
- L1 regularisation (Lasso)
Removes unnecessary coefficients. - Tree-based models (Random Forest, XGBoost)
Usefeature_importances_.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
selected = model.coef_ != 0
3. Practical notes #
- Do not run feature selection before train/test split to avoid leakage.
- For high-dimensional data (e.g., gene or text data), feature selection is essential.
- Think about interpretability, not just accuracy.
Summary #
- Feature selection improves accuracy, efficiency, and interpretability.
- Methods fall into filter, wrapper, and embedded categories.
- SFS/SBS are intuitive but risk human bias and overfitting.
- In practice, embedded methods (Lasso, tree models) are most common.