2.1.3
Robust regression
Summary
- Ordinary least squares (OLS) reacts strongly to outliers because squared residuals explode, so a single erroneous measurement can distort the entire fit.
- The Huber loss keeps squared loss for small residuals but switches to a linear penalty for large ones, reducing the influence of extreme points.
- Tuning the threshold \(\delta\) (epsilon in scikit-learn) and the optional L2 penalty \(\alpha\) balances robustness against variance.
- Combining scaling with cross-validation yields stable models on real-world data sets that often mix nominal points and anomalies.
Intuition #
This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.
Detailed Explanation #
Mathematical formulation #
Let the residual be \(r = y - \hat{y}\). For a chosen threshold \(\delta > 0\), the Huber loss is
$$ \ell_\delta(r) = \begin{cases} \dfrac{1}{2} r^2, & |r| \le \delta, \\ \delta \bigl(|r| - \dfrac{1}{2}\delta\bigr), & |r| > \delta. \end{cases} $$Small residuals are squared exactly as in OLS, but large residuals grow only linearly. The influence function (the derivative) therefore saturates:
$$ \psi_\delta(r) = \begin{cases} r, & |r| \le \delta, \\ \delta\,\mathrm{sign}(r), & |r| > \delta. \end{cases} $$In scikit-learn, the threshold corresponds to the parameter epsilon. Adding an L2 penalty \(\alpha \lVert \boldsymbol\beta \rVert_2^2\) further stabilizes the coefficients when features correlate.
Experiments with Python #
We visualize the loss shapes and compare OLS, Ridge, and Huber on a small synthetic data set that contains a single extreme outlier.
| |
Huber loss versus squared and absolute losses #
| |
A toy data set with an outlier #
| |
Comparing OLS, Ridge, and Huber #
| |
Reading the results #
- OLS (red) is heavily pulled by the outlier.
- Ridge (orange) is slightly more stable thanks to the L2 penalty but still deviates.
- Huber (green) limits the impact of the outlier and follows the main trend better.
References #
- Huber, P. J. (1964). Robust Estimation of a Location Parameter. The Annals of Mathematical Statistics, 35(1), 73–101.
- Hampel, F. R. et al. (1986). Robust Statistics: The Approach Based on Influence Functions. Wiley.
- Huber, P. J., & Ronchetti, E. M. (2009). Robust Statistics (2nd ed.). Wiley.