Jensen–Shannon divergence

Last updated 2020-03-25 Read time 2 min

Summary

Jensen–Shannon divergence symmetrises KL divergence and keeps the value finite.
Compute JSD and its square root (the Jensen–Shannon distance) in Python.
Apply it to clustering, generative-model evaluation, and drift analysis.

1. Definition and properties #

Given two distributions (P) and (Q), let (M = \frac{1}{2}(P + Q)). The Jensen–Shannon divergence is:

$$ \mathrm{JSD}(P \parallel Q) = \frac{1}{2} \mathrm{KL}(P \parallel M) + \frac{1}{2} \mathrm{KL}(Q \parallel M) $$

Symmetric: ( \mathrm{JSD}(P \parallel Q) = \mathrm{JSD}(Q \parallel P) ).
Bounded between 0 and 1 (using log base 2).
The square root of JSD is a proper metric (satisfies the triangle inequality).

2. Python example #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
from scipy.spatial.distance import jensenshannon

p = np.array([0.4, 0.4, 0.2])
q = np.array([0.1, 0.3, 0.6])

js_distance = jensenshannon(p, q, base=2)
js_divergence = js_distance ** 2  # square the distance to obtain divergence

print(f"Jensen-Shannon distance : {js_distance:.4f}")
print(f"Jensen-Shannon divergence: {js_divergence:.4f}")

jensenshannon returns the distance (square root of JSD); square it if you need the divergence.

3. Characteristics and use cases #

Symmetry and stability: avoids KL’s dependency on direction and finiteness issues when supports differ.
Bounded: values stay within a predictable range, making thresholding easier.
Metric: the distance can be used with clustering algorithms that require a metric.

4. Practical examples #

Generative models: measure divergence between generated and real distributions.
Language/topic models: compare probability distributions of words or topics.
Anomaly detection: monitor distribution shifts in time series or streaming data.
Model selection: pick the candidate whose output distribution best matches ground truth.

5. Caveats #

For continuous data, discretise via binning or use density estimation before computing JSD.
Apply smoothing when many zero probabilities are present.
Small divergence does not guarantee superior model performance; interpret alongside other metrics.

Jensen–Shannon divergence provides a stable, symmetric alternative to KL with convenient metric properties. SciPy makes it easy to compute, enabling broad use in monitoring, evaluation, and clustering tasks.