Binning (discretisation) converts a continuous feature into ordered categories. It is useful when a model cannot handle real-valued inputs directly or when you wish to build features such as an “income decile”.
Equal-width vs. equal-frequency #
Let (x_1, \dots, x_n) be a feature. A binning rule partitions the range into intervals (I_k) and replaces each (x_i) with the label of the interval that contains it.
- Equal-width binning divides the range into intervals of the same length.
- Equal-frequency (quantile) binning divides the sorted data so that each bin has approximately the same number of observations (
pandas.qcut).
Equal-frequency bins are more robust to heavy tails, whereas equal-width bins preserve the notion of distance.
Visualising quantile bins #
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng = np.random.default_rng(18)
income = rng.lognormal(mean=10, sigma=0.55, size=1_000)
quantile_edges = np.quantile(income, np.linspace(0, 1, 6))
plt.hist(income, bins=40, color="steelblue", alpha=0.85)
for edge in quantile_edges[1:-1]:
plt.axvline(edge, color="darkorange", linestyle="--", alpha=0.8)
plt.title("Quantile bin edges (equal-frequency, 5 bins)")
plt.xlabel("Income")
plt.ylabel("Count")
plt.show()

Comparing qcut and cut
#
quantile_bins = pd.qcut(income, q=5)
width_bins = pd.cut(income, bins=5)
quantile_counts = pd.Series(quantile_bins).value_counts(sort=False)
width_counts = pd.Series(width_bins).value_counts(sort=False)
indices = np.arange(len(quantile_counts))
fig, ax = plt.subplots(figsize=(7, 4))
ax.bar(indices - 0.2, quantile_counts.values, width=0.4, color="seagreen", alpha=0.8, label="qcut (equal-frequency)")
ax.bar(indices + 0.2, width_counts.values, width=0.4, color="firebrick", alpha=0.6, label="cut (equal-width)")
ax.set_xticks(indices)
ax.set_xticklabels([f"Bin {i}" for i in indices])
ax.set_ylabel("Number of samples")
ax.legend()
ax.set_title("Distribution of samples per bin")
plt.tight_layout()
plt.show()

Equal-frequency binning produces almost identical counts for each bin, while equal-width binning assigns many observations to the dense centre and few to the extremes.
Practical tips #
- Clip extreme outliers before binning; even a single extreme value can stretch the range and break equal-width bins.
- Store the edges produced during training and reuse them later to ensure identical bin definitions.
- Tree-based models rarely need explicit binning. For linear models, however, binning can capture non-linear effects while keeping the feature space compact.