まとめ
- The Hellinger distance measures the difference between probability distributions based on their square roots, ranging from 0 to 1.
- We show the calculation for discrete distributions and compare it to KL divergence.
- Practical notes on handling zeros and normalization are discussed.
1. Definition and Properties #
For discrete distributions \(P = (p_1, \dots, p_n)\) and \(Q = (q_1, \dots, q_n)\), the Hellinger distance is defined as:
$$ H(P, Q) = \frac{1}{\sqrt{2}} \sqrt{ \sum_{i=1}^n \left(\sqrt{p_i} - \sqrt{q_i}\right)^2 } $$
- Values near 0 indicate similar distributions; values near 1 indicate dissimilar ones.
- It satisfies symmetry and the triangle inequality, making it a true metric.
- Unlike KL divergence, it remains finite even when supports do not overlap.
2. Computing in Python #
import numpy as np
def hellinger(p: np.ndarray, q: np.ndarray) -> float:
"""Compute the Hellinger distance between two probability distributions."""
p = np.asarray(p, dtype=float)
q = np.asarray(q, dtype=float)
# Normalize to avoid division errors
p = p / p.sum()
q = q / q.sum()
return float(np.linalg.norm(np.sqrt(p) - np.sqrt(q)) / np.sqrt(2))
p = np.array([0.4, 0.4, 0.2])
q = np.array([0.2, 0.5, 0.3])
print(f"Hellinger distance: {hellinger(p, q):.4f}")
When using histograms, normalize so that values sum to 1 before computing the square-root difference.
3. Key Characteristics #
- Symmetric and metric: Like Jensen–Shannon distance, it satisfies symmetry and triangle inequality.
- Finite with zeros: Always yields a finite value even when distributions contain zeros.
- Square-root scaling: Reduces overemphasis on small probabilities, improving stability with outliers.
4. Practical Applications #
- Bayesian inference: Quantify the difference between prior and posterior distributions.
- Clustering: Measure distances between probabilistic vectors (e.g., topic distributions).
- Concept drift detection: Monitor time-evolving distributions and raise alerts when exceeding thresholds.
5. Practical Considerations #
- The distance depends on histogram binning; design the discretization carefully.
- Apply Laplace smoothing when sample sizes are small to stabilize results.
- Combine with other metrics (KL, Wasserstein, etc.) for a multidimensional understanding of distributional differences.
Summary #
The Hellinger distance measures distributional differences using square-root scaling, producing intuitive values between 0 and 1.
It is robust to zeros and widely used in Bayesian analysis, clustering, and distribution monitoring tasks.