Wasserstein Distance (Earth Mover’s Distance)

Created: 2019-05-01 Last updated: 2020-04-08 Read time: 2 min

まとめ

The Wasserstein distance measures the cost of moving probability mass to transform one distribution into another.
We demonstrate the computation in 1D and introduce common approximations for multidimensional cases.
Practical notes on normalization, computation cost, and parameter tuning are discussed.

1. Definition and Intuition #

For one-dimensional discrete distributions $P$ and $Q$, the 1-Wasserstein distance can be written using their cumulative distribution functions (CDFs):

$$ W_1(P, Q) = \int_{-\infty}^{\infty} |F_P(x) - F_Q(x)| , dx $$

In higher dimensions, it is formulated as an optimal transport problem — finding the minimum cost required to move one distribution’s mass to match another.
Unlike simple mean-based metrics, it reflects both differences in location and shape.

2. Computing in Python #

import numpy as np
from scipy.stats import wasserstein_distance

x = np.random.normal(0, 1, size=1_000)
y = np.random.normal(1, 1.5, size=1_000)

dist = wasserstein_distance(x, y)

print(f"Wasserstein distance: {dist:.3f}")

scipy.stats.wasserstein_distance computes the 1D Wasserstein distance.
For multidimensional data, approximate methods like the Sinkhorn distance from the POT (Python Optimal Transport) library are commonly used in practice.

3. Key Characteristics #

Sensitive to distribution shape: Even if means are equal, differences in variance or multimodality increase the distance.
Robustness: Unlike KL divergence, it remains finite when supports don’t overlap and is less sensitive to outliers.
Computation cost: Becomes expensive in high dimensions — regularized methods such as Sinkhorn iterations help speed it up.

4. Practical Applications #

Generative model evaluation: Quantify the overall difference between generated and real data distributions.
Quality inspection & simulation: Capture differences in full histograms rather than just summary statistics.
Time-series monitoring: Track distribution shifts across periods and trigger alerts when thresholds are exceeded.

5. Practical Considerations #

Because the metric scales with data magnitude, standardizing or normalizing inputs improves interpretability.
Small Wasserstein distance does not necessarily imply identical mean or variance — use in combination with other metrics.
Sinkhorn regularization parameters influence results; tune them based on comparison goals.

Summary #

The Wasserstein distance measures distributional differences by accounting for both position and shape.
It is easy to compute in 1D and can be approximated efficiently in higher dimensions.
Combining it with KL or JS divergence provides a more comprehensive view of distributional shifts.