Skip to content

Bayesian Diagnostics

ArviZ integration for MCMC convergence assessment and Bayesian inference diagnostics. All functions accept ArviZ InferenceData or xarray.DataTree objects.

Inference Diagnostics

information_gain(idata, *, var_names=None, method='kl', multivariate=False)

Compute how much the data updated beliefs about each parameter.

Measures D(posterior || prior) per parameter using the specified divergence. This is the information-theoretic analogue of asking "how informative was the data?"

Parameters:

Name Type Description Default
idata DataTree

ArviZ InferenceData (xarray.DataTree). Must contain "posterior" and "prior" groups.

required
var_names list of str

Parameters to analyze. If None, all parameters shared between posterior and prior are used.

None
method str

Divergence measure: "kl" (default, kNN-based), "js", "hellinger", "tv", "wasserstein", "mmd", "energy".

'kl'
multivariate bool

If True, compute multivariate divergence for vector parameters (requires "kl", "mmd", or "energy"). If False (default), compute per component.

False

Returns:

Type Description
dict[str, float | ndarray]

Maps parameter names to divergence values. Scalar parameters produce float; vector parameters produce np.ndarray of per-component values (or float when multivariate is True).

Raises:

Type Description
ImportError

If ArviZ is not installed.

ValueError

If required groups are missing or method is invalid.

KeyError

If a requested variable is not found in both groups.

Notes

The KL divergence :math:D_{\text{KL}}(\text{posterior} \| \text{prior}) quantifies the information gained from the data in nats. A value near 0 means the data was uninformative for that parameter; a large value means the posterior differs substantially from the prior.

The kNN-based KL estimator is the default because it is fast, requires no density estimation, and works in any dimension.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
...     "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> result = information_gain(idata)
>>> result["mu"] > 0
True
References

.. [1] Kullback, S. & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Math. Stat., 22(1), 79-86. .. [2] Lindley, D. V. (1956). "On a measure of the information provided by an experiment." Annals of Math. Stat., 27(4), 986-1005.

prior_sensitivity(idata, reference_prior_samples, *, var_names=None, method='kl')

Quantify how sensitive posteriors are to the choice of prior.

Compares information gain under the actual prior versus a reference prior. High sensitivity means conclusions depend on the prior choice.

Parameters:

Name Type Description Default
idata DataTree

ArviZ InferenceData with "posterior" and "prior" groups.

required
reference_prior_samples dict[str, ndarray]

Maps parameter names to 1D numpy arrays of samples from a reference (e.g. vague/uninformative) prior.

required
var_names list of str

Parameters to analyze. If None, all parameters present in posterior, prior, and reference_prior_samples are used.

None
method str

Divergence measure. Default is "kl".

'kl'

Returns:

Type Description
dict[str, dict[str, float]]

Maps parameter names to dicts with:

  • "actual": D(posterior || actual_prior)
  • "reference": D(posterior || reference_prior)
  • "sensitivity": abs(actual - reference)
Notes

If sensitivity is high for a parameter, this suggests the data is not very informative for that parameter and conclusions depend on the prior. If sensitivity is low, the data dominates and the prior choice matters little.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
...     "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> ref = {"mu": rng.normal(0, 10, 5000)}
>>> result = prior_sensitivity(idata, ref)
>>> "sensitivity" in result["mu"]
True

bayesian_surprise(idata, *, log_likelihood_group='log_likelihood', var_name=None)

Compute per-observation surprise (self-information).

For each observation i, the surprise is:

.. math:: S(y_i) = -\log \mathbb{E}_\theta[p(y_i \mid \theta)]

approximated via the log-sum-exp trick over MCMC draws.

Parameters:

Name Type Description Default
idata DataTree

ArviZ InferenceData. Must contain the log-likelihood group.

required
log_likelihood_group str

Name of the log-likelihood group. Default is "log_likelihood".

'log_likelihood'
var_name str

Variable name within the group. If None and the group contains exactly one variable, that variable is used.

None

Returns:

Type Description
dict[str, ndarray]

Maps variable names to arrays of per-observation surprise values. Higher values indicate more surprising (influential or outlying) observations.

Raises:

Type Description
ValueError

If the log-likelihood group is missing or var_name cannot be resolved.

Notes

Observations with high surprise are:

  • Influential data points driving the posterior
  • Potential outliers the model struggles to explain
  • Regions of model misspecification

This is closely related to the pointwise WAIC/LOO computation but expressed in the information-theoretic framework.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "log_likelihood": {"y": rng.standard_normal((2, 500, 20))}
... })
>>> result = bayesian_surprise(idata)
>>> result["y"].shape
(20,)
References

.. [1] Itti, L. & Baldi, P. (2009). "Bayesian surprise attracts human attention." Vision Research, 49(10), 1295-1306.

uncertainty_decomposition(idata, *, var_names=None, method='knn', group='posterior_predictive')

Decompose predictive uncertainty into aleatoric and epistemic.

Uses the information-theoretic decomposition:

.. math:: \underbrace{H[p(y^ \mid y)]}{\text{total}} = \underbrace{\mathbb{E}\theta[H[p(y^ \mid \theta)]]}{\text{aleatoric}} + \underbrace{I(y^*; \theta \mid y)}}

Parameters:

Name Type Description Default
idata DataTree

ArviZ InferenceData with the posterior predictive group.

required
var_names list of str

Predicted variable names. If None, all variables in the group are used.

None
method str

Entropy estimation method. "knn" (default) uses the Kozachenko-Leonenko kNN estimator. "kde" uses kernel density estimation.

'knn'
group str

Predictive group name. Default is "posterior_predictive".

'posterior_predictive'

Returns:

Type Description
dict[str, dict[str, float]]

Maps variable names to dicts with:

  • "total": H[p(y* | y)] — marginal predictive entropy
  • "aleatoric": E_theta[H[p(y* | theta)]] — average conditional entropy
  • "epistemic": total - aleatoric = I(y*; theta | y)

Raises:

Type Description
ValueError

If the predictive group is missing or the observation dimension is too small (< 10) for reliable entropy estimation.

Notes

Total entropy is estimated from all posterior predictive samples pooled across chains and draws.

Aleatoric entropy is estimated per MCMC draw: each (chain, draw) pair corresponds to a fixed parameter value theta, and the observation dimension provides the samples from p(y* | theta). The aleatoric estimate is the mean of these per-draw entropies.

This requires the posterior predictive to have an observation dimension (e.g., shape (chain, draw, obs)). If there is only one predicted value per draw, per-draw entropy cannot be estimated.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(0, 1, (2, 500, 50))}
... })
>>> result = uncertainty_decomposition(idata)
>>> result["y"]["epistemic"] >= -0.1
True
References

.. [1] Depeweg, S. et al. (2018). "Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning." ICML.

model_divergence(idata1, idata2, *, var_names=None, method='energy', group='posterior_predictive')

Compare predictive distributions from two models.

Computes the divergence between corresponding variables in the specified group of two InferenceData objects.

Parameters:

Name Type Description Default
idata1 DataTree

Two ArviZ InferenceData objects. Both must contain group.

required
idata2 DataTree

Two ArviZ InferenceData objects. Both must contain group.

required
var_names list of str

Variables to compare. If None, all shared variables are used.

None
method str

Divergence measure. Default is "energy" (symmetric, no density estimation). Other options: "mmd", "js", "kl".

'energy'
group str

Group to compare. Default is "posterior_predictive".

'posterior_predictive'

Returns:

Type Description
dict[str, float]

Maps variable names to divergence values.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata1 = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(0, 1, (2, 500))}
... })
>>> idata2 = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(1, 1, (2, 500))}
... })
>>> result = model_divergence(idata1, idata2)
>>> result["y"] > 0
True

Convergence Diagnostics

chain_divergence(idata, *, var_names=None, method='energy', group='posterior')

Compute pairwise distributional divergence between MCMC chains.

Complements R-hat with a full distributional comparison: R-hat checks whether chains have similar means and variances, while this function checks whether they sample from the same distribution (detecting differences in shape, skewness, and multimodality).

Parameters:

Name Type Description Default
idata DataTree

ArviZ InferenceData with multiple chains in the specified group.

required
var_names list of str

Parameters to analyze. If None, all parameters are used.

None
method str

Divergence measure. Default is "energy" (symmetric, works in any dimension, no density estimation). Other options: "mmd", "js", "kl", "hellinger", "tv", "wasserstein".

'energy'
group str

Group to analyze. Default is "posterior".

'posterior'

Returns:

Type Description
dict[str, ndarray]

Maps parameter names to pairwise divergence matrices of shape (n_chains, n_chains). Diagonal entries are zero (or near-zero); entry (i, j) is the divergence between chains i and j.

Notes

For symmetric methods (energy, MMD, JS, Hellinger, TV, Wasserstein), the matrix is symmetric. For KL divergence, entry (i, j) is :math:D_{\text{KL}}(\text{chain}_i \| \text{chain}_j).

Well-mixed chains should have small pairwise divergences. Large off-diagonal values indicate chains exploring different regions of parameter space.

For method="mmd" the RBF bandwidth is estimated once from the pooled samples across all chains, not per pair. That gives every entry of the returned matrix a shared scale (which is what users actually want from a pairwise-divergence ranking) and halves the wall time by removing the per-pair median-heuristic call. Users who need per-pair bandwidths can call :func:~divergence.ipms.maximum_mean_discrepancy directly.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_divergence(idata)
>>> result["mu"].shape
(4, 4)

chain_ksd(idata, score_fn, *, var_names=None, group='posterior', kernel='imq', bandwidth=None, split=True)

Compute per-chain kernel Stein discrepancy against the target.

Unlike :func:chain_divergence, which tests whether chains agree with each other, this function tests whether each chain has converged to the correct target distribution. It is the only diagnostic that can detect the situation where all chains have converged to the wrong distribution.

The KSD requires only the score function :math:\nabla \log \pi(x) of the target, which is available as a byproduct of gradient-based MCMC (HMC, NUTS).

Parameters:

Name Type Description Default
idata DataTree or InferenceData

ArviZ inference data containing the specified group.

required
score_fn callable

Score function of the target distribution. Takes an array of shape (n, d) and returns an array of shape (n, d) with :math:\nabla_x \log \pi(x) at each point. For scalar parameters (d = 1), the input and output shapes may be (n, 1).

required
var_names list of str

Parameters to analyze. If None, all parameters in the group are used.

None
group str

InferenceData group. Default is "posterior".

'posterior'
kernel str

Kernel for the Stein operator: "imq" (inverse multiquadric, default) or "rbf" (Gaussian). The IMQ kernel provides provable convergence control guarantees [1]_.

'imq'
bandwidth float or None

Kernel bandwidth / scale parameter. If None, the median heuristic is used, computed once from the pooled samples across all chains. Sharing the bandwidth gives every entry a common scale (which is exactly what you want when reading per-chain KSDs as a relative ranking) and removes the dominant per-call cost. Pass an explicit value to override.

None
split bool

If True (default), additionally compute KSD for the first and second halves of each chain. This is analogous to split-:math:\hat{R} and can detect non-stationarity: if the first-half KSD is much larger than the second-half KSD, the chain was still converging during the first half.

True

Returns:

Type Description
dict[str, ChainKSDResult]

Maps parameter names to :class:~divergence.ChainKSDResult named tuples with fields ksd_per_chain, ksd_split_first, ksd_split_second, and ksd_pooled.

Raises:

Type Description
ImportError

If ArviZ is not installed.

ValueError

If the specified group is missing from idata.

Notes

For the IMQ kernel :math:k(x,y) = (c^2 + \|x-y\|^2)^{-1/2}, Gorham and Mackey (2017) [1]_ proved that :math:\mathrm{KSD}(\mu_n, \pi) \to 0 implies :math:\mu_n \Rightarrow \pi (weak convergence) and tightness of :math:\{\mu_n\}. This is strictly stronger than what :math:\hat{R} can guarantee.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_ksd(idata, lambda x: -x)
>>> result["mu"].ksd_pooled < 0.1
True
References

.. [1] Gorham, J. & Mackey, L. (2017). "Measuring sample quality with kernels." ICML. .. [2] Liu, Q., Lee, J., & Jordan, M. (2016). "A kernelized Stein discrepancy for goodness-of-fit tests." ICML.

chain_two_sample_test(idata, *, var_names=None, method='mmd', group='posterior', n_permutations=500, seed=None, low_memory=None)

Pairwise two-sample permutation tests between MCMC chains.

Upgrades :func:chain_divergence from raw divergence magnitudes to calibrated p-values. A small p-value for chains i and j indicates that they are sampling from detectably different distributions, suggesting a convergence failure.

Parameters:

Name Type Description Default
idata DataTree or InferenceData

ArviZ inference data containing the specified group.

required
var_names list of str

Parameters to analyze. If None, all parameters are used.

None
method str

Test statistic: "mmd" (default), "energy", or "kl_knn". These correspond to the methods supported by :func:~divergence.two_sample_test.

'mmd'
group str

InferenceData group. Default is "posterior".

'posterior'
n_permutations int

Number of permutations for the null distribution. Default is 500. Higher values give more precise p-values at increased computational cost.

500
seed int or None

Base random seed for reproducibility. Per-pair seeds are derived as seed + i * n_chains + j to ensure statistical independence across chain pairs while remaining reproducible.

None
low_memory bool or None

Memory strategy passed to :func:~divergence.two_sample_test. None (default) auto-detects based on sample size. Set to True for long chains (N > ~10K per chain) to avoid materializing the O(N²) distance matrix.

None

Returns:

Type Description
dict[str, ChainTestResult]

Maps parameter names to :class:~divergence.ChainTestResult named tuples with fields p_value_matrix, statistic_matrix, min_p_value, and any_significant.

Raises:

Type Description
ImportError

If ArviZ is not installed.

ValueError

If the specified group is missing or method is invalid.

Notes

For well-mixed chains, all pairwise p-values should be large (> 0.05). If any pair has a small p-value, the chains are sampling from detectably different distributions, indicating a convergence problem.

For typical MCMC use (scalar or low-dimensional parameters), the "energy" method is generally faster: it has a 1D sort-based fast path that runs in microseconds even for thousands of draws, and it does not require kernel-bandwidth selection. The "mmd" method is appropriate when you have a specific kernel choice in mind, are working with high-dimensional posteriors, or want to align with a published benchmark that specifies MMD.

For method="mmd" the RBF bandwidth is estimated once from the pooled samples across all chains and reused for every pair — avoids C*(C-1)/2 redundant median-heuristic calls.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_two_sample_test(idata, n_permutations=200, seed=42)
>>> result["mu"].any_significant
False
References

.. [1] Gretton, A. et al. (2012). "A kernel two-sample test." JMLR, 13, 723-773. .. [2] Szekely, G. J. & Rizzo, M. L. (2004). "Testing for equal distributions in high dimension." InterStat.

mixing_diagnostic(idata, *, var_names=None, group='posterior', lag=1, knn_k=5)

Diagnose chain mixing using transfer entropy.

Applies :func:~divergence.transfer_entropy to detect two types of mixing failure:

  1. Non-stationarity within chains: If the first half of a chain predicts the second half (transfer entropy > 0), the chain has not yet reached its stationary distribution.

  2. Spurious dependence between chains: If one chain's trace predicts another's (transfer entropy > 0), the chains are not truly independent — indicating shared non-stationarity or coupling artifacts.

Parameters:

Name Type Description Default
idata DataTree or InferenceData

ArviZ inference data containing the specified group.

required
var_names list of str

Parameters to analyze. If None, all parameters are used.

None
group str

InferenceData group. Default is "posterior".

'posterior'
lag int

Embedding dimension for the target series in the transfer entropy computation. Default is 1.

1
knn_k int

Number of nearest neighbors for the kNN entropy estimator used internally by transfer entropy. Default is 5.

5

Returns:

Type Description
dict[str, MixingDiagnostic]

Maps parameter names to :class:~divergence.MixingDiagnostic named tuples with fields stationarity_te (shape (n_chains,)) and cross_chain_te (shape (n_chains - 1,)).

Raises:

Type Description
ImportError

If ArviZ is not installed.

ValueError

If the group is missing or chains are too short for the requested embedding dimensions.

Notes

For vector parameters (shape (chain, draw, K)), transfer entropy is computed for each of the K components separately and the results are averaged. This avoids the curse of dimensionality in kNN entropy estimation.

Transfer entropy is related to Granger causality: for jointly Gaussian processes the two are equivalent [2]_. The key advantage is that TE is fully nonparametric — it detects any form of directed statistical dependence.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = mixing_diagnostic(idata)
>>> result["mu"].stationarity_te.shape
(4,)
References

.. [1] Schreiber, T. (2000). "Measuring information transfer." Physical Review Letters, 85(2), 461-464. .. [2] Barnett, L., Barrett, A. B., & Seth, A. K. (2009). "Granger causality and transfer entropy are equivalent for Gaussian variables." Physical Review Letters, 103(23), 238701.