f-Divergences¶
General f-divergence framework and named convenience functions for the most common special cases.
f_divergence(sample_p, sample_q, f, *, discrete=False)
¶
Compute a general f-divergence D_f(P || Q).
The f-divergence of P from Q is defined as
D_f(P || Q) = E_Q[f(dP/dQ)] = integral q(x) f(p(x)/q(x)) dx
where f is a convex function with f(1) = 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
f
|
callable
|
Convex generator function with f(1) = 0. Must accept and return np.ndarray (vectorized). |
required |
discrete
|
bool
|
If True, treat samples as discrete categories. Otherwise, estimate densities via kernel density estimation. |
False
|
Returns:
| Type | Description |
|---|---|
float
|
The estimated f-divergence D_f(P || Q). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Notes
Different choices of f yield well-known divergences:
- f(t) = t log(t): KL divergence
- f(t) = 0.5 |t - 1|: total variation distance
- f(t) = (sqrt(t) - 1)^2: squared Hellinger distance
- f(t) = (t - 1)^2: Pearson chi-squared divergence
All f-divergences satisfy:
- Non-negativity: D_f(P || Q) >= 0, with equality iff P = Q (for strictly convex f at 1).
- Data processing inequality: D_f(PK || QK) <= D_f(P || Q) for any Markov kernel K.
- Joint convexity: (P, Q) -> D_f(P || Q) is jointly convex.
For the discrete case, the formula is D_f(P || Q) = sum_i q_i f(p_i/q_i). For the continuous case, densities are estimated via KDE and the integral is computed using the trapezoidal rule.
Examples:
>>> import numpy as np
>>> from divergence import f_divergence
>>> rng = np.random.default_rng(42)
>>> p = rng.choice([0, 1, 2], size=1000, p=[0.2, 0.3, 0.5])
>>> q = rng.choice([0, 1, 2], size=1000, p=[0.3, 0.3, 0.4])
>>> f_divergence(p, q, f=lambda t: (t - 1) ** 2, discrete=True) # chi-squared
0.07...
References
.. [1] Csiszar, I. (1967). "Information-type measures of difference of probability distributions." Studia Sci. Math. Hungar., 2, 299-318.
total_variation_distance(sample_p, sample_q, *, discrete=False)
¶
Total variation distance between P and Q.
TV(P, Q) = 0.5 * integral |p(x) - q(x)| dx
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
discrete
|
bool
|
If True, treat samples as discrete categories. |
False
|
Returns:
| Type | Description |
|---|---|
float
|
Total variation distance, in [0, 1]. |
Notes
Total variation is the largest possible difference in probabilities that P and Q assign to the same event:
TV(P, Q) = sup_A |P(A) - Q(A)|
It is equivalent to the f-divergence with f(t) = 0.5 |t - 1|.
Properties:
- Symmetric: TV(P, Q) = TV(Q, P)
- Bounded: 0 <= TV <= 1
- Metric: satisfies the triangle inequality
- Pinsker's inequality: TV(P, Q) <= sqrt(0.5 * D_KL(P || Q))
Examples:
>>> import numpy as np
>>> from divergence import total_variation_distance
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> total_variation_distance(p, q, discrete=True)
0.16...
References
.. [1] Tsybakov, A. B. (2009). Introduction to Nonparametric Estimation. Springer. Section 2.4.
squared_hellinger_distance(sample_p, sample_q, *, discrete=False)
¶
Squared Hellinger distance between P and Q.
H^2(P, Q) = sum_i (sqrt(p_i) - sqrt(q_i))^2 [discrete] H^2(P, Q) = integral (sqrt(p(x)) - sqrt(q(x)))^2 dx [continuous]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
discrete
|
bool
|
If True, treat samples as discrete categories. |
False
|
Returns:
| Type | Description |
|---|---|
float
|
Squared Hellinger distance, in [0, 2]. |
Notes
The Hellinger distance H(P, Q) = sqrt(H^2(P, Q)) is a proper metric satisfying the triangle inequality. The squared version is returned here because it arises naturally in the f-divergence framework with f(t) = (sqrt(t) - 1)^2.
Properties:
- Symmetric: H^2(P, Q) = H^2(Q, P)
- Bounded: 0 <= H^2 <= 2
- Relation to TV: H^2/2 <= TV <= H * sqrt(2)
- Relation to Bhattacharyya: H^2 = 2(1 - BC(P, Q)) where BC is the Bhattacharyya coefficient.
For normal distributions P = N(mu_1, sigma_1^2) and Q = N(mu_2, sigma_2^2):
H^2 = 2 * (1 - sqrt(2*sigma_1*sigma_2 / (sigma_1^2 + sigma_2^2))
* exp(-(mu_1 - mu_2)^2 / (4*(sigma_1^2 + sigma_2^2))))
Examples:
>>> import numpy as np
>>> from divergence import squared_hellinger_distance
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> squared_hellinger_distance(p, q, discrete=True)
0.04...
References
.. [1] Hellinger, E. (1909). "Neue Begründung der Theorie quadratischer Formen von unendlichvielen Veränderlichen." J. Reine Angew. Math., 136, 210-271.
chi_squared_divergence(sample_p, sample_q, *, discrete=False)
¶
Pearson chi-squared divergence of P from Q.
chi^2(P || Q) = sum_i (p_i - q_i)^2 / q_i [discrete] chi^2(P || Q) = integral (p(x) - q(x))^2 / q(x) dx [continuous]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
discrete
|
bool
|
If True, treat samples as discrete categories. |
False
|
Returns:
| Type | Description |
|---|---|
float
|
Chi-squared divergence, in [0, +inf). |
Notes
This is the f-divergence with f(t) = (t - 1)^2. It is related to the classical Pearson chi-squared goodness-of-fit statistic.
Properties:
- Not symmetric: chi^2(P || Q) != chi^2(Q || P) in general
- Non-negative: chi^2(P || Q) >= 0
- Upper bound on KL: D_KL(P || Q) <= log(1 + chi^2(P || Q))
For normal distributions P = N(mu_1, sigma_1^2), Q = N(mu_2, sigma_2^2), when sigma_1^2 < 2*sigma_2^2:
chi^2(P || Q) = sqrt(sigma_2^2 / (2*sigma_2^2 - sigma_1^2))
* exp((mu_1 - mu_2)^2 / (2*sigma_2^2 - sigma_1^2)) - 1
Examples:
>>> import numpy as np
>>> from divergence import chi_squared_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> chi_squared_divergence(p, q, discrete=True)
0.1...
References
.. [1] Pearson, K. (1900). "On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling." Phil. Mag., 50(302), 157-175.
jeffreys_divergence(sample_p, sample_q, *, discrete=False, base=np.e)
¶
Jeffreys divergence (symmetrized KL divergence).
D_J(P, Q) = D_KL(P || Q) + D_KL(Q || P) = sum_i (p_i - q_i) log(p_i / q_i)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
discrete
|
bool
|
If True, treat samples as discrete categories. |
False
|
base
|
float
|
Base of the logarithm (default: e for nats, 2 for bits). |
e
|
Returns:
| Type | Description |
|---|---|
float
|
Jeffreys divergence, in [0, +inf). |
Notes
Jeffreys divergence is the f-divergence with f(t) = (t - 1) log(t). Unlike KL divergence, it is symmetric.
Properties:
- Symmetric: D_J(P, Q) = D_J(Q, P)
- Non-negative: D_J(P, Q) >= 0
- Equals sum of KL divergences: D_J = D_KL(P || Q) + D_KL(Q || P)
For normal distributions P = N(mu_1, sigma_1^2), Q = N(mu_2, sigma_2^2):
D_J = ((sigma_1^2 - sigma_2^2)^2 + (sigma_1^2 + sigma_2^2)(mu_1 - mu_2)^2)
/ (2 * sigma_1^2 * sigma_2^2)
Examples:
>>> import numpy as np
>>> from divergence import jeffreys_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> jeffreys_divergence(p, q, discrete=True)
0.3...
References
.. [1] Jeffreys, H. (1946). "An invariant form for the prior probability in estimation problems." Proc. Royal Soc. A, 186(1007), 453-461.
cressie_read_divergence(sample_p, sample_q, *, lambda_param=2 / 3, discrete=False)
¶
Cressie-Read power divergence family.
CR_lambda(P || Q) = (1 / (lambda * (lambda + 1))) * sum_i q_i * [(p_i / q_i)^(lambda + 1) - 1]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sample_p
|
ndarray
|
Sample from distribution P. |
required |
sample_q
|
ndarray
|
Sample from distribution Q. |
required |
lambda_param
|
float
|
Power parameter (default: 2/3, the Cressie-Read recommended value). Special cases:
|
2 / 3
|
discrete
|
bool
|
If True, treat samples as discrete categories. |
False
|
Returns:
| Type | Description |
|---|---|
float
|
Cressie-Read divergence, in [0, +inf). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Notes
The Cressie-Read family unifies many important divergences via a single lambda parameter. The generator function is:
f_lambda(t) = (t^(lambda+1) - 1 - (lambda+1)(t - 1)) / (lambda*(lambda+1))
As lambda -> 0, the divergence converges to the KL divergence. As lambda -> -1, it converges to the reverse KL divergence.
Examples:
>>> import numpy as np
>>> from divergence import cressie_read_divergence
>>> p = np.array([0, 0, 0, 1, 1, 2])
>>> q = np.array([0, 1, 1, 1, 2, 2])
>>> cressie_read_divergence(p, q, lambda_param=1.0, discrete=True) # Pearson chi^2 / 2
0.1...
References
.. [1] Cressie, N. & Read, T. R. C. (1984). "Multinomial goodness-of-fit tests." JRSS B, 46(3), 440-464. .. [2] Read, T. R. C. & Cressie, N. (1988). Goodness-of-Fit Statistics for Discrete Multivariate Data. Springer.