PC A.en

PC A.en

Last updated 2026-02-16 Read time 3 min
Summary
  • PCA finds orthogonal directions of maximum variance and projects data onto leading components.
  • Explained-variance ratios provide a quantitative way to choose the number of components.
  • Feature scaling strongly affects PCA, so standardization is often mandatory.

Intuition #

PCA rotates the coordinate system toward directions that capture most variation. Keeping only the strongest axes compresses data while retaining dominant structure.

Detailed Explanation #

1. Why PCA? #

  • High-dimensional data is hard to interpret and to visualise; PCA finds orthogonal directions that summarise the bulk of the variance.
  • The method is unsupervised: it does not use labels, only the covariance structure of the data.
  • Once we project onto the leading components we can visualise, denoise, or feed the compressed features to downstream models.

2. Mathematics #

Given zero-centred data matrix (X \in \mathbb{R}^{n \times d}):

  1. Covariance matrix $$ \Sigma = \frac{1}{n} X^\top X $$
  2. Eigen-decomposition $$ \Sigma v_j = \lambda_j v_j $$ where (v_j) are eigenvectors (principal axes) and (\lambda_j) eigenvalues (explained variance).
  3. Projection $$ Z = X V_k $$ using the top (k) eigenvectors.

3. Create a sample dataset #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
import matplotlib.pyplot as plt
import japanize_matplotlib
from sklearn.datasets import make_blobs

X, y = make_blobs(n_samples=600, n_features=3, random_state=117117)

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(projection="3d")
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y)
ax.set_xlabel("$x_1$")
ax.set_ylabel("$x_2$")
ax.set_zlabel("$x_3$")

3D blobs


4. Run PCA with scikit-learn #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

pca = PCA(n_components=2, whiten=True)
X_pca = pca.fit_transform(StandardScaler().fit_transform(X))

plt.figure(figsize=(8, 8))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y)
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.title("2-D embedding via PCA")
plt.show()

PCA projection


5. Scaling matters #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from sklearn.preprocessing import StandardScaler

X, y = make_blobs(
    n_samples=200,
    n_features=3,
    random_state=11711,
    centers=3,
    cluster_std=2.0,
)
X[:, 1] *= 1000
X[:, 2] *= 0.01

X_ss = StandardScaler().fit_transform(X)

pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)

pca_ss = PCA(n_components=2).fit(X_ss)
X_pca_ss = pca_ss.transform(X_ss)

plt.figure(figsize=(10, 5))
plt.subplot(121)
plt.title("Unscaled features")
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, marker="x", alpha=0.6)
plt.subplot(122)
plt.title("Scaled features")
plt.scatter(X_pca_ss[:, 0], X_pca_ss[:, 1], c=y, marker="x", alpha=0.6)
plt.show()

Scaling effect

PCA is dominated by features with large variance; scaling (or whitening) is essential when feature units differ.


6. Practical considerations #

  • Explained variance ratio: (\lambda_j / \sum_i \lambda_i) helps decide how many PCs to keep (often 80–90%).
  • Computation: PCA is implemented via SVD under the hood; use svd_solver='randomized' for large datasets.
  • Kernel PCA: when linear PCA is not enough, switch to kernels (see the dedicated section) or try UMAP/t-SNE for local structure.