Skip to content

f-Divergences

General f-divergence framework and named convenience functions for the most common special cases.

f_divergence(sample_p, sample_q, f, *, discrete=False)

Compute a general f-divergence D_f(P || Q).

The f-divergence of P from Q is defined as

D_f(P || Q) = E_Q[f(dP/dQ)] = integral q(x) f(p(x)/q(x)) dx

where f is a convex function with f(1) = 0.

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
f callable

Convex generator function with f(1) = 0. Must accept and return np.ndarray (vectorized).

required
discrete bool

If True, treat samples as discrete categories. Otherwise, estimate densities via kernel density estimation.

False

Returns:

Type Description
float

The estimated f-divergence D_f(P || Q).

Raises:

Type Description
ValueError

If discrete=True and P has positive mass where Q has zero mass (P is not absolutely continuous with respect to Q).

Notes

Different choices of f yield well-known divergences:

  • f(t) = t log(t): KL divergence
  • f(t) = 0.5 |t - 1|: total variation distance
  • f(t) = (sqrt(t) - 1)^2: squared Hellinger distance
  • f(t) = (t - 1)^2: Pearson chi-squared divergence

All f-divergences satisfy:

  • Non-negativity: D_f(P || Q) >= 0, with equality iff P = Q (for strictly convex f at 1).
  • Data processing inequality: D_f(PK || QK) <= D_f(P || Q) for any Markov kernel K.
  • Joint convexity: (P, Q) -> D_f(P || Q) is jointly convex.

For the discrete case, the formula is D_f(P || Q) = sum_i q_i f(p_i/q_i). For the continuous case, densities are estimated via KDE and the integral is computed using the trapezoidal rule.

Examples:

>>> import numpy as np
>>> from divergence import f_divergence
>>> rng = np.random.default_rng(42)
>>> p = rng.choice([0, 1, 2], size=1000, p=[0.2, 0.3, 0.5])
>>> q = rng.choice([0, 1, 2], size=1000, p=[0.3, 0.3, 0.4])
>>> f_divergence(p, q, f=lambda t: (t - 1) ** 2, discrete=True)  # chi-squared
0.07...
References

.. [1] Csiszar, I. (1967). "Information-type measures of difference of probability distributions." Studia Sci. Math. Hungar., 2, 299-318.

total_variation_distance(sample_p, sample_q, *, discrete=False)

Total variation distance between P and Q.

TV(P, Q) = 0.5 * integral |p(x) - q(x)| dx

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
discrete bool

If True, treat samples as discrete categories.

False

Returns:

Type Description
float

Total variation distance, in [0, 1].

Notes

Total variation is the largest possible difference in probabilities that P and Q assign to the same event:

TV(P, Q) = sup_A |P(A) - Q(A)|

It is equivalent to the f-divergence with f(t) = 0.5 |t - 1|.

Properties:

  • Symmetric: TV(P, Q) = TV(Q, P)
  • Bounded: 0 <= TV <= 1
  • Metric: satisfies the triangle inequality
  • Pinsker's inequality: TV(P, Q) <= sqrt(0.5 * D_KL(P || Q))

Examples:

>>> import numpy as np
>>> from divergence import total_variation_distance
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> total_variation_distance(p, q, discrete=True)
0.16...
References

.. [1] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer. Section 2.4.

squared_hellinger_distance(sample_p, sample_q, *, discrete=False)

Squared Hellinger distance between P and Q.

H^2(P, Q) = sum_i (sqrt(p_i) - sqrt(q_i))^2 [discrete] H^2(P, Q) = integral (sqrt(p(x)) - sqrt(q(x)))^2 dx [continuous]

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
discrete bool

If True, treat samples as discrete categories.

False

Returns:

Type Description
float

Squared Hellinger distance, in [0, 2].

Notes

The Hellinger distance H(P, Q) = sqrt(H^2(P, Q)) is a proper metric satisfying the triangle inequality. The squared version is returned here because it arises naturally in the f-divergence framework with f(t) = (sqrt(t) - 1)^2.

Properties:

  • Symmetric: H^2(P, Q) = H^2(Q, P)
  • Bounded: 0 <= H^2 <= 2
  • Relation to TV: H^2/2 <= TV <= H * sqrt(2)
  • Relation to Bhattacharyya: H^2 = 2(1 - BC(P, Q)) where BC is the Bhattacharyya coefficient.

For normal distributions P = N(mu_1, sigma_1^2) and Q = N(mu_2, sigma_2^2):

H^2 = 2 * (1 - sqrt(2*sigma_1*sigma_2 / (sigma_1^2 + sigma_2^2))
       * exp(-(mu_1 - mu_2)^2 / (4*(sigma_1^2 + sigma_2^2))))

Examples:

>>> import numpy as np
>>> from divergence import squared_hellinger_distance
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> squared_hellinger_distance(p, q, discrete=True)
0.04...
References

.. [1] Hellinger, E. (1909). "Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen." J. Reine Angew. Math., 136, 210-271.

chi_squared_divergence(sample_p, sample_q, *, discrete=False)

Pearson chi-squared divergence of P from Q.

chi^2(P || Q) = sum_i (p_i - q_i)^2 / q_i [discrete] chi^2(P || Q) = integral (p(x) - q(x))^2 / q(x) dx [continuous]

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
discrete bool

If True, treat samples as discrete categories.

False

Returns:

Type Description
float

Chi-squared divergence, in [0, +inf).

Notes

This is the f-divergence with f(t) = (t - 1)^2. It is related to the classical Pearson chi-squared goodness-of-fit statistic.

Properties:

  • Not symmetric: chi^2(P || Q) != chi^2(Q || P) in general
  • Non-negative: chi^2(P || Q) >= 0
  • Upper bound on KL: D_KL(P || Q) <= log(1 + chi^2(P || Q))

For normal distributions P = N(mu_1, sigma_1^2), Q = N(mu_2, sigma_2^2), when sigma_1^2 < 2*sigma_2^2:

chi^2(P || Q) = sqrt(sigma_2^2 / (2*sigma_2^2 - sigma_1^2))
                * exp((mu_1 - mu_2)^2 / (2*sigma_2^2 - sigma_1^2)) - 1

Examples:

>>> import numpy as np
>>> from divergence import chi_squared_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> chi_squared_divergence(p, q, discrete=True)
0.1...
References

.. [1] Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Phil. Mag., 50(302), 157-175.

jeffreys_divergence(sample_p, sample_q, *, discrete=False, base=np.e)

Jeffreys divergence (symmetrized KL divergence).

D_J(P, Q) = D_KL(P || Q) + D_KL(Q || P) = sum_i (p_i - q_i) log(p_i / q_i)

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
discrete bool

If True, treat samples as discrete categories.

False
base float

Base of the logarithm (default: e for nats, 2 for bits).

e

Returns:

Type Description
float

Jeffreys divergence, in [0, +inf).

Notes

Jeffreys divergence is the f-divergence with f(t) = (t - 1) log(t). Unlike KL divergence, it is symmetric.

Properties:

  • Symmetric: D_J(P, Q) = D_J(Q, P)
  • Non-negative: D_J(P, Q) >= 0
  • Equals sum of KL divergences: D_J = D_KL(P || Q) + D_KL(Q || P)

For normal distributions P = N(mu_1, sigma_1^2), Q = N(mu_2, sigma_2^2):

D_J = ((sigma_1^2 - sigma_2^2)^2 + (sigma_1^2 + sigma_2^2)(mu_1 - mu_2)^2)
      / (2 * sigma_1^2 * sigma_2^2)

Examples:

>>> import numpy as np
>>> from divergence import jeffreys_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> jeffreys_divergence(p, q, discrete=True)
0.3...
References

.. [1] Jeffreys, H. (1946). "An invariant form for the prior probability in estimation problems." Proc. Royal Soc. A, 186(1007), 453-461.

cressie_read_divergence(sample_p, sample_q, *, lambda_param=2 / 3, discrete=False)

Cressie-Read power divergence family.

CR_lambda(P || Q) = (1 / (lambda * (lambda + 1))) * sum_i q_i * [(p_i / q_i)^(lambda + 1) - 1]

Parameters:

Name Type Description Default
sample_p ndarray

Sample from distribution P.

required
sample_q ndarray

Sample from distribution Q.

required
lambda_param float

Power parameter (default: 2/3, the Cressie-Read recommended value).

Special cases:

  • lambda = -1: reverse KL divergence D_KL(Q || P)
  • lambda -> 0: KL divergence D_KL(P || Q) (log-likelihood ratio)
  • lambda = -0.5: scaled squared Hellinger distance
  • lambda = 1: Neyman chi-squared (chi^2(Q || P))
2 / 3
discrete bool

If True, treat samples as discrete categories.

False

Returns:

Type Description
float

Cressie-Read divergence, in [0, +inf).

Raises:

Type Description
ValueError

If discrete=True and P has positive mass where Q has zero mass.

Notes

The Cressie-Read family unifies many important divergences via a single lambda parameter. The generator function is:

f_lambda(t) = (t^(lambda+1) - 1 - (lambda+1)(t - 1)) / (lambda*(lambda+1))

As lambda -> 0, the divergence converges to the KL divergence. As lambda -> -1, it converges to the reverse KL divergence.

Examples:

>>> import numpy as np
>>> from divergence import cressie_read_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> cressie_read_divergence(p, q, lambda_param=1.0, discrete=True)  # Pearson chi^2 / 2
0.1...
References

.. [1] Cressie, N. & Read, T. R. C. (1984). "Multinomial goodness-of-fit tests." JRSS B, 46(3), 440-464. .. [2] Read, T. R. C. & Cressie, N. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer.