4.3.11
Cohen's Kappa
Summary
- Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
- Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
- Combine charts and complementary metrics for effective model comparison and threshold tuning.
- Confusion Matrix — understanding this concept first will make learning smoother
1. Definition #
Let \(p_o\) be the observed agreement and \(p_e\) the expected agreement by random chance. The coefficient is
$$ \kappa = \frac{p_o - p_e}{1 - p_e} $$- \(\kappa = 1\): perfect agreement
- \(\kappa = 0\): no better than chance
- \(\kappa < 0\): worse than chance
2. Computing in Python 3.13 #
| |
| |
The function works for multi-class classification. Use weights="quadratic" to compute the weighted version for ordinal labels.
3. Interpretation guide #
Landis & Koch (1977) proposed the following rule of thumb. Adapt the thresholds to the expectations of your domain.
| κ | Interpretation |
|---|---|
| < 0 | Poor agreement |
| 0.0–0.2 | Slight agreement |
| 0.2–0.4 | Fair agreement |
| 0.4–0.6 | Moderate agreement |
| 0.6–0.8 | Substantial agreement |
| 0.8–1.0 | Almost perfect |
4. Benefits for model evaluation #
- Robust to imbalance: Models that simply predict the majority class receive a low κ, counteracting overly optimistic Accuracy.
- Annotation quality checks: Compare model predictions with human labels or agreement between annotators objectively.
- Weighted Kappa: For ordinal outcomes (e.g., 5-point ratings) account for how far away incorrect predictions fall.
5. Practical tips #
- A high Accuracy but low κ signals that the model may rely on chance agreement. Inspect the confusion matrix for failure patterns.
- Regulated industries sometimes require κ-based reporting—document the computation pipeline for audits.
- Use κ when auditing training labels to identify annotators or subsets with inconsistent decisions.
Key takeaways #
- Cohen’s Kappa subtracts chance agreement, making it suitable for imbalanced problems and annotation benchmarking.
cohen_kappa_scorein scikit-learn provides both standard and weighted versions with minimal code.- Combine κ with Accuracy, F1, and other metrics for a well-rounded assessment of model performance and labeling quality.