Jensen–Shannon divergence

Eval

Jensen–Shannon divergence

Created: Last updated: Read time: 2 min
まとめ
  • Jensen–Shannon divergence symmetrises KL divergence and keeps the value finite.
  • Compute JSD and its square root (the Jensen–Shannon distance) in Python.
  • Apply it to clustering, generative-model evaluation, and drift analysis.

1. Definition and properties #

Given two distributions (P) and (Q), let (M = \frac{1}{2}(P + Q)). The Jensen–Shannon divergence is:

$$ \mathrm{JSD}(P \parallel Q) = \frac{1}{2} \mathrm{KL}(P \parallel M) + \frac{1}{2} \mathrm{KL}(Q \parallel M) $$

  • Symmetric: ( \mathrm{JSD}(P \parallel Q) = \mathrm{JSD}(Q \parallel P) ).
  • Bounded between 0 and 1 (using log base 2).
  • The square root of JSD is a proper metric (satisfies the triangle inequality).

2. Python example #

import numpy as np
from scipy.spatial.distance import jensenshannon

p = np.array([0.4, 0.4, 0.2])
q = np.array([0.1, 0.3, 0.6])

js_distance = jensenshannon(p, q, base=2)
js_divergence = js_distance ** 2  # square the distance to obtain divergence

print(f"Jensen-Shannon distance : {js_distance:.4f}")
print(f"Jensen-Shannon divergence: {js_divergence:.4f}")

jensenshannon returns the distance (square root of JSD); square it if you need the divergence.


3. Characteristics and use cases #

  • Symmetry and stability: avoids KL’s dependency on direction and finiteness issues when supports differ.
  • Bounded: values stay within a predictable range, making thresholding easier.
  • Metric: the distance can be used with clustering algorithms that require a metric.

4. Practical examples #

  • Generative models: measure divergence between generated and real distributions.
  • Language/topic models: compare probability distributions of words or topics.
  • Anomaly detection: monitor distribution shifts in time series or streaming data.
  • Model selection: pick the candidate whose output distribution best matches ground truth.

5. Caveats #

  • For continuous data, discretise via binning or use density estimation before computing JSD.
  • Apply smoothing when many zero probabilities are present.
  • Small divergence does not guarantee superior model performance; interpret alongside other metrics.

Jensen–Shannon divergence provides a stable, symmetric alternative to KL with convenient metric properties. SciPy makes it easy to compute, enabling broad use in monitoring, evaluation, and clustering tasks.