2.2.4
Linear Discriminant Analysis (LDA)
- LDA is supervised dimensionality reduction that maximizes between-class variance while minimizing within-class variance.
- Because labels are used, LDA is often a strong preprocessing step for classification tasks.
- Performance depends on class distribution and covariance assumptions.
- Principal Component Analysis (PCA) — understanding this concept first will make learning smoother
Intuition #
Unlike PCA, LDA optimizes for class separability, not just overall spread. It searches for projection directions where classes are compact and well separated.
Detailed Explanation #
1. PCA vs LDA #
- PCA: unsupervised, keeps the directions of largest variance irrespective of class labels.
- LDA: supervised, searches for directions that maximise the ratio of between-class variance to within-class variance.
2. Formulation #
With labelled classes (C_1, \dots, C_k):
- Within-class scatter $$ S_W = \sum_{j=1}^k \sum_{x_i \in C_j} (x_i - \mu_j)(x_i - \mu_j)^\top $$
- Between-class scatter $$ S_B = \sum_{j=1}^k n_j (\mu_j - \mu)(\mu_j - \mu)^\top $$
- Optimisation $$ J(w) = \frac{w^\top S_B w}{w^\top S_W w} $$ The eigenvectors of (S_W^{-1} S_B) give the discriminant directions. At most (k-1) components carry information.
3. Build a dataset #
| |
4. Apply LDA #
| |
5. Compare with PCA #
| |
PCA mixes the classes because it ignores labels; LDA keeps them separated.
6. Practical notes #
- The number of useful discriminants is at most
n_classes - 1. - Standardise features before fitting, especially when different units are mixed.
- LDA assumes roughly equal covariance within classes; when that is violated, consider QDA or regularised LDA.
FAQ #
What is Linear Discriminant Analysis (LDA)? #
Linear Discriminant Analysis (LDA) is a supervised dimensionality reduction and classification technique. It finds linear combinations of features that best separate two or more classes by maximizing between-class scatter while minimizing within-class scatter. Unlike PCA, LDA uses class labels, making it ideal as a preprocessing step before classification.
What is the LDA formula? #
LDA optimizes the Fisher criterion:
$$ J(w) = \frac{w^\top S_B w}{w^\top S_W w} $$where \(S_W\) is the within-class scatter matrix and \(S_B\) is the between-class scatter matrix. The optimal projection directions \(w\) are the eigenvectors of \(S_W^{-1} S_B\). The maximum number of discriminant components is \(k - 1\) where \(k\) is the number of classes.
What is the difference between LDA and PCA? #
| PCA | LDA | |
|---|---|---|
| Supervision | Unsupervised | Supervised (uses labels) |
| Goal | Maximize total variance | Maximize class separation |
| Max components | min(n_features, n_samples−1) | k−1 (k = number of classes) |
| Best for | Compression, denoising | Classification preprocessing |
Use PCA when you have no labels or want general-purpose compression. Use LDA when you want projections that help a downstream classifier separate classes.
When does LDA fail or perform poorly? #
LDA assumes that each class has a similar (homogeneous) covariance matrix. It can fail when:
- Class covariances differ significantly (use Quadratic Discriminant Analysis, QDA, instead).
- The number of features exceeds the number of samples (the within-class scatter matrix becomes singular — use regularised LDA or PCA first).
- Classes are not linearly separable (consider kernel LDA or nonlinear methods).
How do I use LDA in scikit-learn? #
| |
n_components controls how many discriminant axes to keep (at most n_classes - 1). You can also use lda.predict(X_test) directly for classification without a separate classifier.