Feature Selection Basics

Basic

Feature Selection Basics | Why it matters and common methods

Created: Last updated: Read time: 2 min

Feature selection keeps only the truly useful features from a large set. Removing irrelevant features helps prevent overfitting, speeds up training, and improves interpretability.


1. Why feature selection? #

  • Avoid overfitting: noisy features can hurt model accuracy.
  • Improve efficiency: fewer features means faster training and inference.
  • Better interpretability: it becomes clearer what the model relies on.

2. Three major approaches #

2.1 Filter methods #

  • Rank features using statistical criteria.
  • Lightweight because no model training is required.

Examples:

  • Correlation
  • Chi-square test (χ²)
  • Mutual information
from sklearn.feature_selection import chi2, SelectKBest
X_new = SelectKBest(score_func=chi2, k=5).fit_transform(X, y)

2.2 Wrapper methods #

  • Train models and select features based on performance.
  • Common examples:
    • Sequential Forward Selection (SFS)
    • Sequential Backward Selection (SBS)

⚠️ Note

  • SFS/SBS involve repeated training and human decisions, which can introduce bias and overfitting.
  • They also become expensive on large datasets.

2.3 Embedded methods #

  • Use feature importance obtained during model training.
  • Efficient because selection happens in a single run.

Examples:

  • L1 regularisation (Lasso)
    Removes unnecessary coefficients.
  • Tree-based models (Random Forest, XGBoost)
    Use feature_importances_.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1).fit(X, y)
selected = model.coef_ != 0

3. Practical notes #

  • Do not run feature selection before train/test split to avoid leakage.
  • For high-dimensional data (e.g., gene or text data), feature selection is essential.
  • Think about interpretability, not just accuracy.

Summary #

  • Feature selection improves accuracy, efficiency, and interpretability.
  • Methods fall into filter, wrapper, and embedded categories.
  • SFS/SBS are intuitive but risk human bias and overfitting.
  • In practice, embedded methods (Lasso, tree models) are most common.