2
Basics
Summary
- model assumptions and when the method is appropriate.
- objective criteria and how they influence model behavior.
- implementation and validation choices for stable results.
Intuition #
This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.
Detailed Explanation #
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that ’learn’, that is, methods that leverage data to improve performance on some set of tasks.
Algorithm Selection Flowchart #
Use this as a starting point to choose the right method based on your data and objective.
flowchart TD
START["Do you have labels?"]
START -->|Yes| SUP["Supervised Learning"]
START -->|No| UNSUP["Unsupervised Learning"]
SUP --> STYPE{"Target type?"}
STYPE -->|Continuous| REG["Regression"]
STYPE -->|Categorical| CLF["Classification"]
REG --> RLIN{"Strong linear<br>relationship?"}
RLIN -->|Yes| R1["Linear / Ridge / Lasso"]
RLIN -->|No| R2{"Dataset size?"}
R2 -->|Small| R3["SVR / Polynomial"]
R2 -->|Large| R4["XGBoost / LightGBM"]
CLF --> CLIN{"Many features?<br>or linearly separable?"}
CLIN -->|Yes| C1["Logistic Regression / SVM"]
CLIN -->|No| C2{"Interpretability<br>needed?"}
C2 -->|Yes| C3["Decision Tree / RuleFit"]
C2 -->|No| C4["Random Forest / XGBoost"]
UNSUP --> UTYPE{"Goal?"}
UTYPE -->|Grouping| CLUST["Clustering"]
UTYPE -->|Dim. reduction| DIM["Dimensionality Reduction"]
UTYPE -->|Outlier detection| ANOM["Anomaly Detection"]
CLUST --> CK{"Number of clusters<br>known?"}
CK -->|Yes| CK1["k-means / GMM"]
CK -->|No| CK2["DBSCAN / HDBSCAN"]
DIM --> DLIN{"Linear sufficient?"}
DLIN -->|Yes| D1["PCA / SVD"]
DLIN -->|No| D2["t-SNE / Isomap"]
ANOM --> A1["Isolation Forest / ADTK"]
style START fill:#2563eb,color:#fff
style SUP fill:#1e40af,color:#fff
style UNSUP fill:#1e40af,color:#fff
style REG fill:#3b82f6,color:#fff
style CLF fill:#3b82f6,color:#fff
style CLUST fill:#3b82f6,color:#fff
style DIM fill:#3b82f6,color:#fff
style ANOM fill:#3b82f6,color:#fff
Algorithm Quick Reference #
| Category | Method | Strength | Best for |
|---|---|---|---|
| Linear Regression | Linear Regression | Interpretable | Baseline, linear data |
| Regularised | Ridge / Lasso | Controls overfitting | Many features |
| Linear Classifier | Logistic Regression | Probability output | Binary baseline |
| Margin Classifier | SVM | High-dim friendly | Text, small data |
| Decision Tree | Decision Tree | Human-readable rules | Interpretability |
| Bagging | Random Forest | Stable & accurate | General-purpose |
| Boosting | XGBoost / LightGBM | Top accuracy | Competitions, large data |
| Centroid | k-means | Fast & simple | Spherical clusters |
| Density | DBSCAN | Shape-free | Arbitrary clusters |
| Linear DR | PCA | Variance-maximising | Preprocessing, viz |
| Nonlinear DR | t-SNE | Local structure | High-dim visualisation |
| Anomaly | Isolation Forest | Unsupervised | Outlier screening |