Basics

2

Basics

Summary
  • model assumptions and when the method is appropriate.
  • objective criteria and how they influence model behavior.
  • implementation and validation choices for stable results.

Intuition #

This method should be interpreted through its assumptions, data conditions, and how parameter choices affect generalization.

Detailed Explanation #

Machine learning (ML) is a field of inquiry devoted to understanding and building methods that ’learn’, that is, methods that leverage data to improve performance on some set of tasks.

Wikipedia | Machine Learning


Algorithm Selection Flowchart #

Use this as a starting point to choose the right method based on your data and objective.

flowchart TD START["Do you have labels?"] START -->|Yes| SUP["Supervised Learning"] START -->|No| UNSUP["Unsupervised Learning"] SUP --> STYPE{"Target type?"} STYPE -->|Continuous| REG["Regression"] STYPE -->|Categorical| CLF["Classification"] REG --> RLIN{"Strong linear<br>relationship?"} RLIN -->|Yes| R1["Linear / Ridge / Lasso"] RLIN -->|No| R2{"Dataset size?"} R2 -->|Small| R3["SVR / Polynomial"] R2 -->|Large| R4["XGBoost / LightGBM"] CLF --> CLIN{"Many features?<br>or linearly separable?"} CLIN -->|Yes| C1["Logistic Regression / SVM"] CLIN -->|No| C2{"Interpretability<br>needed?"} C2 -->|Yes| C3["Decision Tree / RuleFit"] C2 -->|No| C4["Random Forest / XGBoost"] UNSUP --> UTYPE{"Goal?"} UTYPE -->|Grouping| CLUST["Clustering"] UTYPE -->|Dim. reduction| DIM["Dimensionality Reduction"] UTYPE -->|Outlier detection| ANOM["Anomaly Detection"] CLUST --> CK{"Number of clusters<br>known?"} CK -->|Yes| CK1["k-means / GMM"] CK -->|No| CK2["DBSCAN / HDBSCAN"] DIM --> DLIN{"Linear sufficient?"} DLIN -->|Yes| D1["PCA / SVD"] DLIN -->|No| D2["t-SNE / Isomap"] ANOM --> A1["Isolation Forest / ADTK"] style START fill:#2563eb,color:#fff style SUP fill:#1e40af,color:#fff style UNSUP fill:#1e40af,color:#fff style REG fill:#3b82f6,color:#fff style CLF fill:#3b82f6,color:#fff style CLUST fill:#3b82f6,color:#fff style DIM fill:#3b82f6,color:#fff style ANOM fill:#3b82f6,color:#fff

Algorithm Quick Reference #

CategoryMethodStrengthBest for
Linear RegressionLinear RegressionInterpretableBaseline, linear data
RegularisedRidge / LassoControls overfittingMany features
Linear ClassifierLogistic RegressionProbability outputBinary baseline
Margin ClassifierSVMHigh-dim friendlyText, small data
Decision TreeDecision TreeHuman-readable rulesInterpretability
BaggingRandom ForestStable & accurateGeneral-purpose
BoostingXGBoost / LightGBMTop accuracyCompetitions, large data
Centroidk-meansFast & simpleSpherical clusters
DensityDBSCANShape-freeArbitrary clusters
Linear DRPCAVariance-maximisingPreprocessing, viz
Nonlinear DRt-SNELocal structureHigh-dim visualisation
AnomalyIsolation ForestUnsupervisedOutlier screening