Hellinger Distance

Created: 2019-06-05 Last updated: 2020-05-06 Read time: 2 min

まとめ

The Hellinger distance measures the difference between probability distributions based on their square roots, ranging from 0 to 1.
We show the calculation for discrete distributions and compare it to KL divergence.
Practical notes on handling zeros and normalization are discussed.

1. Definition and Properties #

For discrete distributions $P = (p_1, \dots, p_n)$ and $Q = (q_1, \dots, q_n)$, the Hellinger distance is defined as:

$$ H(P, Q) = \frac{1}{\sqrt{2}} \sqrt{ \sum_{i=1}^n \left(\sqrt{p_i} - \sqrt{q_i}\right)^2 } $$

Values near 0 indicate similar distributions; values near 1 indicate dissimilar ones.
It satisfies symmetry and the triangle inequality, making it a true metric.
Unlike KL divergence, it remains finite even when supports do not overlap.

2. Computing in Python #

import numpy as np

def hellinger(p: np.ndarray, q: np.ndarray) -> float:
    """Compute the Hellinger distance between two probability distributions."""
    p = np.asarray(p, dtype=float)
    q = np.asarray(q, dtype=float)

    # Normalize to avoid division errors
    p = p / p.sum()
    q = q / q.sum()

    return float(np.linalg.norm(np.sqrt(p) - np.sqrt(q)) / np.sqrt(2))

p = np.array([0.4, 0.4, 0.2])
q = np.array([0.2, 0.5, 0.3])

print(f"Hellinger distance: {hellinger(p, q):.4f}")

When using histograms, normalize so that values sum to 1 before computing the square-root difference.

3. Key Characteristics #

Symmetric and metric: Like Jensen–Shannon distance, it satisfies symmetry and triangle inequality.
Finite with zeros: Always yields a finite value even when distributions contain zeros.
Square-root scaling: Reduces overemphasis on small probabilities, improving stability with outliers.

4. Practical Applications #

Bayesian inference: Quantify the difference between prior and posterior distributions.
Clustering: Measure distances between probabilistic vectors (e.g., topic distributions).
Concept drift detection: Monitor time-evolving distributions and raise alerts when exceeding thresholds.

5. Practical Considerations #

The distance depends on histogram binning; design the discretization carefully.
Apply Laplace smoothing when sample sizes are small to stabilize results.
Combine with other metrics (KL, Wasserstein, etc.) for a multidimensional understanding of distributional differences.

Summary #

The Hellinger distance measures distributional differences using square-root scaling, producing intuitive values between 0 and 1.
It is robust to zeros and widely used in Bayesian analysis, clustering, and distribution monitoring tasks.