Binning numerical features

Prep

Binning numerical features

Created: Last updated: Read time: 2 min

Binning (discretisation) converts a continuous feature into ordered categories. It is useful when a model cannot handle real-valued inputs directly or when you wish to build features such as an “income decile”.

Equal-width vs. equal-frequency #

Let (x_1, \dots, x_n) be a feature. A binning rule partitions the range into intervals (I_k) and replaces each (x_i) with the label of the interval that contains it.

  • Equal-width binning divides the range into intervals of the same length.
  • Equal-frequency (quantile) binning divides the sorted data so that each bin has approximately the same number of observations (pandas.qcut).

Equal-frequency bins are more robust to heavy tails, whereas equal-width bins preserve the notion of distance.

Visualising quantile bins #

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng = np.random.default_rng(18)
income = rng.lognormal(mean=10, sigma=0.55, size=1_000)
quantile_edges = np.quantile(income, np.linspace(0, 1, 6))

plt.hist(income, bins=40, color="steelblue", alpha=0.85)
for edge in quantile_edges[1:-1]:
    plt.axvline(edge, color="darkorange", linestyle="--", alpha=0.8)
plt.title("Quantile bin edges (equal-frequency, 5 bins)")
plt.xlabel("Income")
plt.ylabel("Count")
plt.show()

Visualising quantile bins figure

Comparing qcut and cut #

quantile_bins = pd.qcut(income, q=5)
width_bins = pd.cut(income, bins=5)

quantile_counts = pd.Series(quantile_bins).value_counts(sort=False)
width_counts = pd.Series(width_bins).value_counts(sort=False)
indices = np.arange(len(quantile_counts))

fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(indices - 0.2, quantile_counts.values, width=0.4, color="seagreen", alpha=0.8, label="qcut (equal-frequency)")
ax.bar(indices + 0.2, width_counts.values, width=0.4, color="firebrick", alpha=0.6, label="cut (equal-width)")
ax.set_xticks(indices)
ax.set_xticklabels([f"Bin {i}" for i in indices])
ax.set_ylabel("Number of samples")
ax.legend()
ax.set_title("Distribution of samples per bin")
plt.tight_layout()
plt.show()

Comparing qcut and cut figure

Equal-frequency binning produces almost identical counts for each bin, while equal-width binning assigns many observations to the dense centre and few to the extremes.

Practical tips #

  • Clip extreme outliers before binning; even a single extreme value can stretch the range and break equal-width bins.
  • Store the edges produced during training and reuse them later to ensure identical bin definitions.
  • Tree-based models rarely need explicit binning. For linear models, however, binning can capture non-linear effects while keeping the feature space compact.