Evaluation

4

Evaluation

Summary
  • Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
  • Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
  • Combine charts and complementary metrics for effective model comparison and threshold tuning.

Metrics #


Quick Reference #

Classification Metrics #

MetricImbalance-safeEvaluates probabilityThreshold-freeMulti-classPrimary use
AccuracyBalanced-data overview
Balanced AccuracyImbalanced accuracy
Precision / Recall / F1Cost-asymmetric tasks
ROC-AUCThreshold-free comparison
Average PrecisionRare-positive tasks
Log LossProbability calibration
Brier ScoreCalibration (MSE-based)
MCCUses all confusion cells
Cohen’s KappaAnnotator agreement

Regression Metrics #

MetricScale-freeOutlier-robustDirectionalPrimary use
MAEIntuitive average error
RMSEPenalises large errors
Explained variance
Adjusted R²Variance adjusted for features
MAPEBusiness-friendly % error
WAPEWeighted % error
MASETime-series comparison
MBEBias detection
Median AERegression with outliers
Pinball LossQuantile forecast evaluation

Ranking & Distance Metrics #

MetricCategoryRank-awarePrimary use
NDCGRankingSearch & recommendation quality
MAPRankingPrecision-based ranking
Recall@kRankingTop-k coverage
Hit RateRankingRecommendation hit ratio
KL DivergenceDistanceInformation-theoretic divergence
JS DivergenceDistanceSymmetric KLD
WassersteinDistanceGeometric distribution distance
Cosine SimilarityDistanceVector direction similarity