4.4.5
Hellinger Distance
Summary
- The Hellinger distance measures the difference between probability distributions based on their square roots, ranging from 0 to 1.
- We show the calculation for discrete distributions and compare it to KL divergence.
- Practical notes on handling zeros and normalization are discussed.
1. Definition and Properties #
For discrete distributions \(P = (p_1, \dots, p_n)\) and \(Q = (q_1, \dots, q_n)\), the Hellinger distance is defined as:
$$ H(P, Q) = \frac{1}{\sqrt{2}} \sqrt{ \sum_{i=1}^n \left(\sqrt{p_i} - \sqrt{q_i}\right)^2 } $$- Values near 0 indicate similar distributions; values near 1 indicate dissimilar ones.
- It satisfies symmetry and the triangle inequality, making it a true metric.
- Unlike KL divergence, it remains finite even when supports do not overlap.
2. Computing in Python #
| |
When using histograms, normalize so that values sum to 1 before computing the square-root difference.
3. Key Characteristics #
- Symmetric and metric: Like Jensen–Shannon distance, it satisfies symmetry and triangle inequality.
- Finite with zeros: Always yields a finite value even when distributions contain zeros.
- Square-root scaling: Reduces overemphasis on small probabilities, improving stability with outliers.
4. Practical Applications #
- Bayesian inference: Quantify the difference between prior and posterior distributions.
- Clustering: Measure distances between probabilistic vectors (e.g., topic distributions).
- Concept drift detection: Monitor time-evolving distributions and raise alerts when exceeding thresholds.
5. Practical Considerations #
- The distance depends on histogram binning; design the discretization carefully.
- Apply Laplace smoothing when sample sizes are small to stabilize results.
- Combine with other metrics (KL, Wasserstein, etc.) for a multidimensional understanding of distributional differences.
Summary #
The Hellinger distance measures distributional differences using square-root scaling, producing intuitive values between 0 and 1.
It is robust to zeros and widely used in Bayesian analysis, clustering, and distribution monitoring tasks.