Divergence: The Dissolution of Uncertainty — One Bit at a Time¶
In 1948, Claude Shannon published A Mathematical Theory of Communication while working at Bell Labs, laying the foundation for information theory: A framework that quantifies uncertainty, surprise, and the fundamental limits of communication. Legend has it that when Shannon was unsure what to call his measure of uncertainty, John von Neumann advised him: "Call it entropy. Nobody really knows what entropy is, so in a debate you will always have the advantage."
This notebook is a self-contained introduction to the core measures of information theory, using the Divergence Python package to compute each one from data. We progress from the foundational concept of entropy through cross entropy, KL divergence, and Jensen-Shannon divergence, to the multivariate measures - mutual information, joint entropy, and conditional entropy — showing how they all connect in a beautiful web of relationships.
Who this is for: Anyone curious about information theory: data scientists, machine learning practitioners, physicists, or students encountering these ideas for the first time. We assume basic probability but nothing more.
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from divergence import (
conditional_entropy_from_samples,
continuous_conditional_entropy_from_samples,
continuous_cross_entropy_from_sample,
continuous_entropy_from_sample,
continuous_jensen_shannon_divergence_from_sample,
continuous_joint_entropy_from_samples,
continuous_mutual_information_from_samples,
continuous_relative_entropy_from_sample,
cross_entropy_from_samples,
discrete_conditional_entropy_of_y_given_x,
discrete_cross_entropy,
discrete_entropy,
discrete_jensen_shannon_divergence,
discrete_joint_entropy,
discrete_mutual_information,
discrete_relative_entropy,
entropy_from_samples,
jensen_shannon_divergence_from_samples,
joint_entropy_from_samples,
mutual_information_from_samples,
relative_entropy_from_samples,
)
plt.rcParams.update({
'figure.figsize': (8, 4),
'axes.spines.top': False,
'axes.spines.right': False,
'font.size': 12,
})
from pathlib import Path
FIGURES_DIR = Path('figures/divergence')
FIGURES_DIR.mkdir(parents=True, exist_ok=True)
1. Setting the Stage: Our Distributions¶
Throughout this notebook we work with two normal distributions:
$$p = \mathcal{N}(\mu_p = 2,\; \sigma_p = 3), \qquad q = \mathcal{N}(\mu_q = 1,\; \sigma_q = 2).$$
We draw large samples from each and let Divergence estimate information-theoretic quantities via kernel density estimation (KDE). For every measure we also compute the analytical (exact) value so you can see how close the estimates are.
np.random.seed(42)
# Parameters
mu_p, sigma_p = 2, 3
mu_q, sigma_q = 1, 2
n = 10_000
# Antithetic sampling for variance reduction
z_p = np.random.randn(n)
sample_p = np.concatenate([mu_p + sigma_p * z_p, mu_p - sigma_p * z_p])
z_q = np.random.randn(n)
sample_q = np.concatenate([mu_q + sigma_q * z_q, mu_q - sigma_q * z_q])
# Exact densities for analytical comparisons
pdf_p = lambda x: sp.stats.norm.pdf(x, mu_p, sigma_p)
pdf_q = lambda x: sp.stats.norm.pdf(x, mu_q, sigma_q)
x = np.linspace(-12, 16, 500)
fig, ax = plt.subplots()
ax.plot(x, pdf_p(x), label=r'$p = \mathcal{N}(2, 9)$', color='steelblue', linewidth=2)
ax.plot(x, pdf_q(x), label=r'$q = \mathcal{N}(1, 4)$', color='coral', linewidth=2)
ax.fill_between(x, pdf_p(x), alpha=0.15, color='steelblue')
ax.fill_between(x, pdf_q(x), alpha=0.15, color='coral')
ax.set_xlabel('x')
ax.set_ylabel('Density')
ax.set_title('The two distributions we will compare throughout')
ax.legend()
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'two_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
We also define small discrete samples to illustrate the discrete variants of each measure.
# Discrete samples — think of these as observed category labels
discrete_p = np.array([1, 2, 3, 3, 3, 3, 3, 3, 3, 3]) # heavily concentrated on 3
discrete_q = np.array([1, 2, 3, 2, 3, 3, 3, 2, 1, 1]) # more spread out
2. Entropy — Measuring Uncertainty¶
Shannon borrowed the name entropy from thermodynamics, where Ludwig Boltzmann had used it to count microstates. In information theory, entropy answers a deceptively simple question: how surprised should you be, on average, by an outcome drawn from a distribution?
Why it matters: Entropy is the theoretical lower bound on the average number of bits (or nats) needed to encode a message from a source. It is the bedrock on which all other information-theoretic measures are built.
$$H(X) = -\mathbb{E}_p\!\left[\ln p(X)\right] = -\int p(x) \ln p(x)\, dx$$
For a discrete distribution with $k$ outcomes, entropy is maximized by the uniform distribution at $H = \ln k$ (or $\log_2 k$ bits). Greater concentration means lower entropy — less uncertainty.
Units: The base of the logarithm determines the unit. Base $e$ gives nats, base 2 gives bits (the classic Shannon unit), and base 10 gives hartleys. Divergence defaults to nats.
Continuous entropy¶
# Analytical entropy of a normal distribution: H = 0.5 * (1 + ln(2π σ²))
def normal_entropy(sigma):
return 0.5 * (1.0 + np.log(2 * np.pi * sigma**2))
H_p_est = entropy_from_samples(sample_p)
H_q_est = entropy_from_samples(sample_q)
H_p_exact = normal_entropy(sigma_p)
H_q_exact = normal_entropy(sigma_q)
print(f'H(p) estimated = {H_p_est:.4f}, analytical = {H_p_exact:.4f}')
print(f'H(q) estimated = {H_q_est:.4f}, analytical = {H_q_exact:.4f}')
H(p) estimated = 2.5311, analytical = 2.5176 H(q) estimated = 2.1233, analytical = 2.1121
Distribution $p$ has higher variance ($\sigma_p = 3$ vs $\sigma_q = 2$), so it has higher entropy — more uncertainty, more "spread".
Discrete entropy¶
H_dp = discrete_entropy(discrete_p)
H_dq = discrete_entropy(discrete_q)
H_max = np.log(3) # uniform over 3 categories
print(f'H(p_discrete) = {H_dp:.4f} (concentrated — low entropy)')
print(f'H(q_discrete) = {H_dq:.4f} (spread out — higher entropy)')
print(f'H(uniform) = {H_max:.4f} (maximum for k=3 categories)')
H(p_discrete) = 0.6390 (concentrated — low entropy) H(q_discrete) = 1.0889 (spread out — higher entropy) H(uniform) = 1.0986 (maximum for k=3 categories)
3. Cross Entropy — The Cost of Being Wrong¶
If entropy measures the optimal encoding cost, cross entropy measures what happens when you use the wrong code. Suppose data comes from distribution $p$, but you design your encoding assuming distribution $q$. The average cost is the cross entropy:
$$H_q(p) = -\mathbb{E}_p\!\left[\ln q(X)\right] = -\int p(x) \ln q(x)\, dx$$
Why it matters: Cross entropy is the loss function behind logistic regression and most neural network classifiers. Minimizing cross entropy is equivalent to maximizing the likelihood of the true labels under the model's predicted distribution.
Gibbs' inequality guarantees $H_q(p) \geq H(p)$ always — using the wrong codebook never helps. Equality holds only when $q = p$.
Continuous cross entropy¶
# Analytical cross entropy between normals: H_q(p) = 0.5*ln(2πσ_q²) + (σ_p² + (μ_p-μ_q)²)/(2σ_q²)
def normal_cross_entropy(mu_p, sigma_p, mu_q, sigma_q):
return 0.5 * np.log(2 * np.pi * sigma_q**2) + (sigma_p**2 + (mu_p - mu_q)**2) / (2 * sigma_q**2)
CE_pq_est = cross_entropy_from_samples(sample_p, sample_q)
CE_qp_est = cross_entropy_from_samples(sample_q, sample_p)
CE_pq_exact = normal_cross_entropy(mu_p, sigma_p, mu_q, sigma_q)
CE_qp_exact = normal_cross_entropy(mu_q, sigma_q, mu_p, sigma_p)
print(f'H_q(p) estimated = {CE_pq_est:.4f}, analytical = {CE_pq_exact:.4f}')
print(f'H_p(q) estimated = {CE_qp_est:.4f}, analytical = {CE_qp_exact:.4f}')
print(f'\nGibbs\' inequality check: H_q(p) = {CE_pq_est:.4f} >= H(p) = {H_p_est:.4f}? {CE_pq_est >= H_p_est - 1e-6}')
H_q(p) estimated = 2.8301, analytical = 2.8621 H_p(q) estimated = 2.3061, analytical = 2.2953 Gibbs' inequality check: H_q(p) = 2.8301 >= H(p) = 2.5311? True
Discrete cross entropy¶
CE_dp_dq = discrete_cross_entropy(discrete_p, discrete_q)
CE_dq_dp = discrete_cross_entropy(discrete_q, discrete_p)
print(f'H_q(p) = {CE_dp_dq:.4f} (encoding p with q\'s codebook)')
print(f'H_p(q) = {CE_dq_dp:.4f} (encoding q with p\'s codebook)')
H_q(p) = 0.9738 (encoding p with q's codebook) H_p(q) = 1.4708 (encoding q with p's codebook)
4. Relative Entropy (KL Divergence) — Information Gained¶
In 1951, Solomon Kullback and Richard Leibler introduced what they called discrimination information — now universally known as the Kullback-Leibler divergence. It measures the extra cost of using the wrong distribution, above and beyond the optimum:
$$D_{\text{KL}}(p \| q) = \mathbb{E}_p\!\left[\ln \frac{p(X)}{q(X)}\right] = H_q(p) - H(p)$$
Why it matters: KL divergence is the workhorse of Bayesian inference (it measures information gained when updating from prior $q$ to posterior $p$), variational methods (the ELBO minimizes a KL term), and model selection.
Critical property — asymmetry: $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general. This is not a true distance.
Continuous KL divergence¶
# Analytical KL between normals
def normal_kl(mu_1, sigma_1, mu_2, sigma_2):
return ((mu_1 - mu_2)**2 + sigma_1**2 - sigma_2**2) / (2 * sigma_2**2) + np.log(sigma_2 / sigma_1)
KL_pq_est = relative_entropy_from_samples(sample_p, sample_q)
KL_qp_est = relative_entropy_from_samples(sample_q, sample_p)
KL_pq_exact = normal_kl(mu_p, sigma_p, mu_q, sigma_q)
KL_qp_exact = normal_kl(mu_q, sigma_q, mu_p, sigma_p)
print(f'KL(p || q) estimated = {KL_pq_est:.4f}, analytical = {KL_pq_exact:.4f}')
print(f'KL(q || p) estimated = {KL_qp_est:.4f}, analytical = {KL_qp_exact:.4f}')
print(f'\nAsymmetry: KL(p||q) ≠ KL(q||p): {KL_pq_est:.4f} ≠ {KL_qp_est:.4f}')
KL(p || q) estimated = 0.2990, analytical = 0.3445 KL(q || p) estimated = 0.1827, analytical = 0.1832 Asymmetry: KL(p||q) ≠ KL(q||p): 0.2990 ≠ 0.1827
Verify the key relationship: $D_{\text{KL}}(p \| q) = H_q(p) - H(p)$
kl_from_ce = CE_pq_est - H_p_est
print(f'KL(p||q) directly = {KL_pq_est:.6f}')
print(f'H_q(p) - H(p) = {kl_from_ce:.6f}')
print(f'Match? {np.isclose(KL_pq_est, kl_from_ce, rtol=1e-6)}')
KL(p||q) directly = 0.299003 H_q(p) - H(p) = 0.299003 Match? True
Discrete KL divergence¶
KL_dp_dq = discrete_relative_entropy(discrete_p, discrete_q)
KL_dq_dp = discrete_relative_entropy(discrete_q, discrete_p)
print(f'KL(p || q) = {KL_dp_dq:.4f}')
print(f'KL(q || p) = {KL_dq_dp:.4f} ← different! KL is asymmetric')
KL(p || q) = 0.3348 KL(q || p) = 0.3819 ← different! KL is asymmetric
5. Jensen-Shannon Divergence — A Symmetric Alternative¶
Named after Johan Jensen (of Jensen's inequality fame) and Claude Shannon, the Jensen-Shannon divergence symmetrizes KL by averaging both directions through a mixture $m = \tfrac{1}{2}(p + q)$:
$$\text{JSD}(p \| q) = \frac{1}{2} D_{\text{KL}}(p \| m) + \frac{1}{2} D_{\text{KL}}(q \| m)$$
Why it matters: JSD is a true metric (its square root satisfies the triangle inequality), it is always finite and bounded in $[0, \ln 2]$ (nats) or $[0, 1]$ (bits). It appears in the original GAN paper as the divergence that the discriminator implicitly minimizes.
Key properties:
- Symmetric: $\text{JSD}(p \| q) = \text{JSD}(q \| p)$
- Bounded: $0 \leq \text{JSD} \leq \ln 2$ (nats)
Continuous JSD¶
JSD_pq = jensen_shannon_divergence_from_samples(sample_p, sample_q)
JSD_qp = jensen_shannon_divergence_from_samples(sample_q, sample_p)
print(f'JSD(p, q) = {JSD_pq:.6f}')
print(f'JSD(q, p) = {JSD_qp:.6f} ← symmetric!')
print(f'\nBounded: 0 ≤ {JSD_pq:.4f} ≤ ln(2) = {np.log(2):.4f}')
print(f'In bits: JSD = {jensen_shannon_divergence_from_samples(sample_p, sample_q, base=2.0):.6f} (bounded in [0, 1])')
JSD(p, q) = 0.052551 JSD(q, p) = 0.052551 ← symmetric! Bounded: 0 ≤ 0.0526 ≤ ln(2) = 0.6931 In bits: JSD = 0.075815 (bounded in [0, 1])
Discrete JSD¶
JSD_dp = discrete_jensen_shannon_divergence(discrete_p, discrete_q)
JSD_dq = discrete_jensen_shannon_divergence(discrete_q, discrete_p)
print(f'JSD(p, q) = {JSD_dp:.6f}')
print(f'JSD(q, p) = {JSD_dq:.6f} ← symmetric!')
JSD(p, q) = 0.086305 JSD(q, p) = 0.086305 ← symmetric!
6. Mutual Information — Dependence Beyond Correlation¶
Shannon (1948) defined mutual information alongside entropy. It answers the question: how much does knowing $X$ tell me about $Y$? Formally, it is the KL divergence between the joint distribution and the product of the marginals:
$$I(X; Y) = D_{\text{KL}}\big(p_{X,Y} \| p_X \otimes p_Y\big) = \mathbb{E}_{p_{X,Y}}\!\left[\ln \frac{p_{X,Y}(x,y)}{p_X(x)\, p_Y(y)}\right]$$
Why it matters: Unlike Pearson correlation, mutual information captures all statistical dependence — linear and nonlinear. It is the gold standard for measuring association in feature selection, neuroscience, and genomics.
Key properties:
- Symmetric: $I(X; Y) = I(Y; X)$
- Non-negative: $I(X; Y) \geq 0$, with equality iff $X$ and $Y$ are independent
- $I(X; Y) = H(X) + H(Y) - H(X, Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)$
We demonstrate with a bivariate normal where we can control the correlation $\rho$ and compare to the analytical result.
# Bivariate normal setup
mu_x, sigma_x = 2, 3
mu_y, sigma_y = 1, 2
rho = 0.5
z = np.random.randn(n)
sample_x = mu_x + sigma_x * z
sample_y = mu_y + sigma_y * (rho * z + np.sqrt(1.0 - rho**2) * np.random.randn(n))
# Analytical MI for bivariate normal: I = -0.5 * ln(1 - ρ²)
MI_exact = -0.5 * np.log(1.0 - rho**2)
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].scatter(sample_x[:2000], sample_y[:2000], alpha=0.2, s=5, color='steelblue')
axes[0].set_xlabel('X'); axes[0].set_ylabel('Y')
axes[0].set_title(f'Bivariate normal (ρ = {rho})')
# Also show independent case
sample_y_indep = mu_y + sigma_y * np.random.randn(n)
axes[1].scatter(sample_x[:2000], sample_y_indep[:2000], alpha=0.2, s=5, color='coral')
axes[1].set_xlabel('X'); axes[1].set_ylabel('Y')
axes[1].set_title('Independent (ρ = 0)')
plt.tight_layout()
plt.savefig(FIGURES_DIR / 'bivariate_normal.png', dpi=300, bbox_inches='tight')
plt.show()
MI_est = mutual_information_from_samples(sample_x, sample_y)
MI_indep = mutual_information_from_samples(sample_x, sample_y_indep)
print(f'I(X; Y) estimated = {MI_est:.4f}, analytical = {MI_exact:.4f} (correlated)')
print(f'I(X; Y) independent = {MI_indep:.4f} (should be ≈ 0)')
# Symmetry
MI_yx = mutual_information_from_samples(sample_y, sample_x)
print(f'\nSymmetry: I(X;Y) = {MI_est:.6f}, I(Y;X) = {MI_yx:.6f}')
I(X; Y) estimated = 0.1500, analytical = 0.1438 (correlated) I(X; Y) independent = 0.0094 (should be ≈ 0) Symmetry: I(X;Y) = 0.149988, I(Y;X) = 0.149988
Discrete mutual information¶
discrete_x = np.array([1, 1, 3, 1, 2, 3])
discrete_y = np.array([1, 1, 1, 3, 2, 1])
MI_d = discrete_mutual_information(discrete_x, discrete_y)
MI_d_rev = discrete_mutual_information(discrete_y, discrete_x)
print(f'I(X; Y) = {MI_d:.4f}')
print(f'I(Y; X) = {MI_d_rev:.4f} ← symmetric!')
I(X; Y) = 0.5493 I(Y; X) = 0.5493 ← symmetric!
7. Joint Entropy — Total Uncertainty in Two Variables¶
The joint entropy of two random variables is the total uncertainty in the pair $(X, Y)$ considered together:
$$H(X, Y) = -\mathbb{E}_{p_{X,Y}}\!\left[\ln p_{X,Y}(x, y)\right]$$
Why it matters: Joint entropy tells you how many bits you need to describe both variables simultaneously. A key inequality: $H(X, Y) \leq H(X) + H(Y)$, with equality if and only if $X$ and $Y$ are independent. Any dependence reduces total uncertainty below the sum of individual uncertainties.
H_joint = joint_entropy_from_samples(sample_x, sample_y)
H_x = entropy_from_samples(sample_x)
H_y = entropy_from_samples(sample_y)
print(f'H(X, Y) = {H_joint:.4f}')
print(f'H(X) + H(Y) = {H_x + H_y:.4f} (would equal H(X,Y) if independent)')
print(f'Difference = {(H_x + H_y) - H_joint:.4f} (= mutual information!)')
H(X, Y) = 4.4713 H(X) + H(Y) = 4.6499 (would equal H(X,Y) if independent) Difference = 0.1786 (= mutual information!)
Discrete joint entropy¶
H_joint_d = discrete_joint_entropy(discrete_x, discrete_y)
print(f'H(X, Y) = {H_joint_d:.4f}')
H(X, Y) = 1.3297
8. Conditional Entropy — What Remains Unknown¶
Conditional entropy measures the residual uncertainty about $Y$ after you have observed $X$:
$$H(Y|X) = -\mathbb{E}_{p_{X,Y}}\!\left[\ln \frac{p_{X,Y}(x,y)}{p_X(x)}\right] = H(X, Y) - H(X)$$
Why it matters: Conditional entropy drives feature selection (a feature is useful if it reduces conditional entropy of the target), decision tree splitting, and communication with side information.
The chain rule of entropy — arguably the most elegant identity in information theory:
$$H(X, Y) = H(X) + H(Y|X) = H(Y) + H(X|Y)$$
H_y_given_x = conditional_entropy_from_samples(sample_x, sample_y)
H_x_given_y = conditional_entropy_from_samples(sample_y, sample_x)
print(f'H(Y|X) = {H_y_given_x:.4f}')
print(f'H(X|Y) = {H_x_given_y:.4f}')
H(Y|X) = 1.9634 H(X|Y) = 2.3579
Verify the chain rule: $H(X, Y) = H(X) + H(Y|X)$¶
chain_1 = H_x + H_y_given_x
chain_2 = H_y + H_x_given_y
print(f'H(X,Y) = {H_joint:.4f}')
print(f'H(X) + H(Y|X) = {chain_1:.4f} match? {np.isclose(H_joint, chain_1, rtol=1e-2)}')
print(f'H(Y) + H(X|Y) = {chain_2:.4f} match? {np.isclose(H_joint, chain_2, rtol=1e-2)}')
H(X,Y) = 4.4713 H(X) + H(Y|X) = 4.4855 match? True H(Y) + H(X|Y) = 4.4856 match? True
Verify the MI connection: $I(X; Y) = H(Y) - H(Y|X)$¶
MI_from_cond = H_y - H_y_given_x
print(f'I(X;Y) directly = {MI_est:.4f}')
print(f'H(Y) - H(Y|X) = {MI_from_cond:.4f}')
print(f'Match? {np.isclose(MI_est, MI_from_cond, rtol=1e-1)}')
I(X;Y) directly = 0.1500 H(Y) - H(Y|X) = 0.1643 Match? True
Discrete conditional entropy¶
H_dy_given_dx = discrete_conditional_entropy_of_y_given_x(discrete_x, discrete_y)
print(f'H(Y|X) = {H_dy_given_dx:.4f}')
# Verify discrete chain rule
H_dx = discrete_entropy(discrete_x)
print(f'\nChain rule: H(X) + H(Y|X) = {H_dx + H_dy_given_dx:.4f}, H(X,Y) = {H_joint_d:.4f}')
print(f'Match? {np.isclose(H_dx + H_dy_given_dx, H_joint_d)}')
H(Y|X) = 0.3183 Chain rule: H(X) + H(Y|X) = 1.3297, H(X,Y) = 1.3297 Match? True
9. The Information-Theoretic Web¶
All the measures we have seen are connected by a small set of elegant identities. Understanding these relationships is more important than memorizing individual definitions.
Fundamental identities¶
| Identity | Meaning | |---|---| | $D_{\text{KL}}(p \| q) = H_q(p) - H(p)$ | KL = cross entropy minus entropy | | $H(X,Y) = H(X) + H(Y|X)$ | Chain rule (both directions) | | $I(X;Y) = H(X) + H(Y) - H(X,Y)$ | MI as redundancy | | $I(X;Y) = H(X) - H(X|Y)$ | MI as uncertainty reduction | | $I(X;Y) = H(Y) - H(Y|X)$ | Symmetric form |
The Venn diagram view¶
Think of $H(X)$ and $H(Y)$ as two overlapping circles. Their union is $H(X,Y)$, their intersection is $I(X;Y)$, and the non-overlapping parts are $H(X|Y)$ and $H(Y|X)$:
$$H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X)$$
Computational verification¶
Let's verify several of these identities with our estimated values:
print('=== Verifying information-theoretic identities ===')
print()
# Identity 1: I(X;Y) = H(X) + H(Y) - H(X,Y)
MI_from_joint = H_x + H_y - H_joint
print(f'I(X;Y) = H(X) + H(Y) - H(X,Y)')
print(f' {MI_est:.4f} ≈ {H_x:.4f} + {H_y:.4f} - {H_joint:.4f} = {MI_from_joint:.4f}')
print()
# Identity 2: H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X)
venn_sum = H_x_given_y + MI_est + H_y_given_x
print(f'H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X)')
print(f' {H_joint:.4f} ≈ {H_x_given_y:.4f} + {MI_est:.4f} + {H_y_given_x:.4f} = {venn_sum:.4f}')
print()
# Identity 3 (discrete): KL = CE - H
kl_check = CE_dp_dq - H_dp
print(f'KL(p||q) = H_q(p) - H(p) [discrete]')
print(f' {KL_dp_dq:.4f} ≈ {CE_dp_dq:.4f} - {H_dp:.4f} = {kl_check:.4f}')
=== Verifying information-theoretic identities === I(X;Y) = H(X) + H(Y) - H(X,Y) 0.1500 ≈ 2.5222 + 2.1277 - 4.4713 = 0.1786 H(X,Y) = H(X|Y) + I(X;Y) + H(Y|X) 4.4713 ≈ 2.3579 + 0.1500 + 1.9634 = 4.4713 KL(p||q) = H_q(p) - H(p) [discrete] 0.3348 ≈ 0.9738 - 0.6390 = 0.3348
10. Where to Go from Here¶
This notebook covered the classical core of information theory — the measures that Shannon, Kullback, and Leibler gave us in the 1940s and 50s. But the story doesn't end here. The Divergence package provides two companion notebooks that continue the journey:
Beyond Kullback-Leibler — A Menagerie of Divergences¶
KL divergence has sharp edges: it's asymmetric, can blow up to infinity, and demands absolute continuity. Over seven decades, mathematicians built alternatives — Csiszár's f-divergence family, Hellinger's gentle metric, Pearson's chi-squared, and Rényi's parameterized telescope. Each tames a different wildness.
Moving Earth, Counting Neighbors — Distribution Distances and Statistical Testing¶
What if you don't have densities — only samples? Kantorovich's optimal transport, Gretton's kernel methods, and Kozachenko & Leonenko's neighbor-distance estimators work directly with point clouds. Plus: how to turn any distance into a rigorous two-sample hypothesis test.
Recommended reading¶
- Cover & Thomas, Elements of Information Theory (2nd ed.) — the definitive textbook
- MacKay, Information Theory, Inference, and Learning Algorithms — freely available online, beautifully written
- Shannon (1948), A Mathematical Theory of Communication — the original paper, still remarkably readable
The Divergence Notebook Series¶
| # | Notebook | What it covers |
|---|---|---|
| 1 | Divergence (this notebook) | Shannon's foundations: entropy, cross entropy, KL divergence, Jensen-Shannon, mutual information, joint and conditional entropy |
| 2 | Beyond KL | f-divergences (TV, Hellinger, chi-squared, Jeffreys, Cressie-Read) and the Rényi family |
| 3 | Distances and Testing | Sample-based methods: Wasserstein, Sinkhorn, energy distance, MMD, kNN estimators, two-sample permutation tests |
| 4 | Dependence and Causality | Multivariate dependence (TC, NMI, VI) and directed information flow (transfer entropy) |
| 5 | Bayesian Diagnostics | End-to-end MCMC with emcee on the Nile change-point — convergence diagnostics, information gain, Bayesian surprise |
| 6 | Real-World Applications | Stock market contagion, crop yields, Phillips Curve — real data, real stakes |
| 7 | Score-Based Divergences: Fisher and Stein | Fisher divergence and kernel Stein discrepancy |
| 8 | Did My Sampler Find the Truth? | KSD as convergence diagnostic with NumPyro: NUTS vs VI, the 250-year journey from Bayes to Stein |
| 9 | Phillips Curve TVP | Time-varying Phillips Curve with PyJAGS Gibbs sampling — stagflation as a structural break |