Least squares finds the coefficients of a function that best fits pairs of observations (x_i, y_i)
by minimizing the sum of squared residuals. We focus on the simplest case, a straight line y = wx + b
, and walk through the intuition and a practical implementation.
Math is rendered with KaTeX. $\hat y$
denotes the model prediction and $\epsilon$
denotes noise.
Goal #
- Learn the line
$\hat y = wx + b$
that best fits the data. - “Best” means minimizing the sum of squared errors (SSE):
$\displaystyle L(w,b) = \sum_{i=1}^n (y_i - (w x_i + b))^2$
Create a simple dataset #
We generate a noisy straight line and fix the random seed for reproducibility.
import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib # optional for Japanese labels
rng = np.random.RandomState(42)
n_samples = 200
# True line (slope 0.8, intercept 0.5) with noise
X = np.linspace(-10, 10, n_samples)
epsilon = rng.normal(loc=0.0, scale=1.0, size=n_samples)
y = 0.8 * X + 0.5 + epsilon
# Reshape to 2D for scikit-learn: (n_samples, 1)
X_2d = X.reshape(-1, 1)
plt.figure(figsize=(10, 5))
plt.scatter(X, y, marker="x", label="observations", c="orange")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.legend()
plt.show()
In scikit-learn, features are always a 2D array: rows are samples and columns are features. Use X.reshape(-1, 1)
for a single feature.
Inspect the noise #
Let’s check the distribution of epsilon
.
plt.figure(figsize=(10, 5))
plt.hist(epsilon, bins=30)
plt.xlabel("$\\epsilon$")
plt.ylabel("count")
plt.show()
Linear regression (least squares) with scikit-learn #
We use sklearn.linear_model.LinearRegression.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
model = LinearRegression() # fit_intercept=True by default
model.fit(X_2d, y)
print("slope w:", model.coef_[0])
print("intercept b:", model.intercept_)
y_pred = model.predict(X_2d)
# Metrics
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)
print("MSE:", mse)
print("R^2:", r2)
# Plot
plt.figure(figsize=(10, 5))
plt.scatter(X, y, marker="x", label="observations", c="orange")
plt.plot(X, y_pred, label="fitted line", c="C0")
plt.xlabel("$x$")
plt.ylabel("$y$")
plt.legend()
plt.show()
Scaling is not required to solve ordinary least squares, but it helps with multivariate problems and regularization.
Closed-form solution (reference) #
For $\hat y = wx + b$
, the minimizers are
$\displaystyle w = \frac{\operatorname{Cov}(x,y)}{\operatorname{Var}(x)}$
$\displaystyle b = \bar y - w,\bar x$
Verify with NumPy:
x_mean, y_mean = X.mean(), y.mean()
w_hat = ((X - x_mean) * (y - y_mean)).sum() / ((X - x_mean) ** 2).sum()
b_hat = y_mean - w_hat * x_mean
print(w_hat, b_hat)
Common pitfalls #
- Array shapes:
X
should be(n_samples, n_features)
. Even for 1 feature, usereshape(-1, 1)
. - Target shape:
y
can be(n_samples,)
.(n,1)
also works but mind broadcasting. - Intercept: default
fit_intercept=True
. If you centered features and target,False
is fine. - Reproducibility: use a fixed seed via
np.random.RandomState
ornp.random.default_rng
.
Going further (multivariate) #
For multiple features, keep X
as (n_samples, n_features)
. Pipelines let you combine preprocessing and the estimator.
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
X_multi = rng.normal(size=(n_samples, 2))
y_multi = 1.0 * X_multi[:, 0] - 2.0 * X_multi[:, 1] + 0.3 + rng.normal(size=n_samples)
pipe = make_pipeline(StandardScaler(), LinearRegression())
pipe.fit(X_multi, y_multi)
Code blocks are for learning purposes; figures are pre-rendered for the site.