Principal Component Analysis (PCA) is the workhorse for dimensionality reduction. By rotating the coordinate system so that the axes follow directions of maximum variance, PCA compresses the data while keeping as much information as possible.
1. Why PCA? #
- High-dimensional data is hard to interpret and to visualise; PCA finds orthogonal directions that summarise the bulk of the variance.
- The method is unsupervised: it does not use labels, only the covariance structure of the data.
- Once we project onto the leading components we can visualise, denoise, or feed the compressed features to downstream models.
2. Mathematics #
Given zero-centred data matrix (X \in \mathbb{R}^{n \times d}):
- Covariance matrix $$ \Sigma = \frac{1}{n} X^\top X $$
- Eigen-decomposition $$ \Sigma v_j = \lambda_j v_j $$ where (v_j) are eigenvectors (principal axes) and (\lambda_j) eigenvalues (explained variance).
- Projection $$ Z = X V_k $$ using the top (k) eigenvectors.
3. Create a sample dataset #
import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.datasets import make_blobs
X, y = make_blobs(n_samples=600, n_features=3, random_state=117117)
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection="3d")
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_zlabel("$x_3$")
4. Run PCA with scikit-learn #
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
pca = PCA(n_components=2, whiten=True)
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))
plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("2-D embedding via PCA")
plt.show()
5. Scaling matters #
from sklearn.preprocessing import StandardScaler
X, y = make_blobs(
n_samples=200,
n_features=3,
random_state=11711,
centers=3,
cluster_std=2.0,
)
X[:, 1] *= 1000
X[:, 2] *= 0.01
X_ss = StandardScaler().fit_transform(X)
pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)
pca_ss = PCA(n_components=2).fit(X_ss)
X_pca_ss = pca_ss.transform(X_ss)
plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title("Unscaled features")
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, marker="x", alpha=0.6)
plt.subplot(122)
plt.title("Scaled features")
plt.scatter(X_pca_ss[:, 0], X_pca_ss[:, 1], c=y, marker="x", alpha=0.6)
plt.show()
PCA is dominated by features with large variance; scaling (or whitening) is essential when feature units differ.
6. Practical considerations #
- Explained variance ratio: (\lambda_j / \sum_i \lambda_i) helps decide how many PCs to keep (often 80–90%).
- Computation: PCA is implemented via SVD under the hood; use
svd_solver='randomized'for large datasets. - Kernel PCA: when linear PCA is not enough, switch to kernels (see the dedicated section) or try UMAP/t-SNE for local structure.
Summary #
- PCA rotates the coordinate system to capture maximum variance with fewer axes.
- Always centre the data; scale it when magnitudes differ.
- Inspect scree plots to choose (k); a handful of components often capture most variance.
- SVD and kernel PCA are natural extensions when you need more flexibility.