Cohen's Kappa

Last updated 2020-06-17 Read time 2 min

Summary

Understand the fundamentals of this metric, what it evaluates, and how to interpret the results.
Compute and visualise the metric with Python 3.13 code examples, covering key steps and practical checkpoints.
Combine charts and complementary metrics for effective model comparison and threshold tuning.

Confusion Matrix — understanding this concept first will make learning smoother

1. Definition #

Let $p_o$ be the observed agreement and $p_e$ the expected agreement by random chance. The coefficient is

$$ \kappa = \frac{p_o - p_e}{1 - p_e} $$

$\kappa = 1$: perfect agreement
$\kappa = 0$: no better than chance
$\kappa < 0$: worse than chance

2. Computing in Python 3.13 #

1
2
python --version  # e.g. Python 3.13.0
pip install scikit-learn

1
2
3
4
from sklearn.metrics import cohen_kappa_score, confusion_matrix

print("Cohen's Kappa:", cohen_kappa_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

The function works for multi-class classification. Use weights="quadratic" to compute the weighted version for ordinal labels.

3. Interpretation guide #

Landis & Koch (1977) proposed the following rule of thumb. Adapt the thresholds to the expectations of your domain.

κ	Interpretation
< 0	Poor agreement
0.0–0.2	Slight agreement
0.2–0.4	Fair agreement
0.4–0.6	Moderate agreement
0.6–0.8	Substantial agreement
0.8–1.0	Almost perfect

4. Benefits for model evaluation #

Robust to imbalance: Models that simply predict the majority class receive a low κ, counteracting overly optimistic Accuracy.
Annotation quality checks: Compare model predictions with human labels or agreement between annotators objectively.
Weighted Kappa: For ordinal outcomes (e.g., 5-point ratings) account for how far away incorrect predictions fall.

5. Practical tips #

A high Accuracy but low κ signals that the model may rely on chance agreement. Inspect the confusion matrix for failure patterns.
Regulated industries sometimes require κ-based reporting—document the computation pipeline for audits.
Use κ when auditing training labels to identify annotators or subsets with inconsistent decisions.

Key takeaways #

Cohen’s Kappa subtracts chance agreement, making it suitable for imbalanced problems and annotation benchmarking.
cohen_kappa_score in scikit-learn provides both standard and weighted versions with minimal code.
Combine κ with Accuracy, F1, and other metrics for a well-rounded assessment of model performance and labeling quality.