Bayesian Diagnostics¶

ArviZ integration for MCMC convergence assessment and Bayesian inference diagnostics. All functions accept ArviZ InferenceData or xarray.DataTree objects.

Inference Diagnostics¶

`information_gain(idata, *, var_names=None, method='kl', multivariate=False)` ¶

Compute how much the data updated beliefs about each parameter.

Measures D(posterior || prior) per parameter using the specified divergence. This is the information-theoretic analogue of asking "how informative was the data?"

Parameters:

Name	Type	Description	Default
`idata`	`DataTree`	ArviZ InferenceData (`xarray.DataTree`). Must contain `"posterior"` and `"prior"` groups.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters shared between posterior and prior are used.	`None`
`method`	`str`	Divergence measure: `"kl"` (default, kNN-based), `"js"`, `"hellinger"`, `"tv"`, `"wasserstein"`, `"mmd"`, `"energy"`.	`'kl'`
`multivariate`	`bool`	If `True`, compute multivariate divergence for vector parameters (requires `"kl"`, `"mmd"`, or `"energy"`). If `False` (default), compute per component.	`False`

Returns:

Type	Description
`dict[str, float \| ndarray]`	Maps parameter names to divergence values. Scalar parameters produce `float`; vector parameters produce `np.ndarray` of per-component values (or `float` when multivariate is True).

Raises:

Type	Description
`ImportError`	If ArviZ is not installed.
`ValueError`	If required groups are missing or method is invalid.
`KeyError`	If a requested variable is not found in both groups.

Notes

The KL divergence :math:D_{\text{KL}}(\text{posterior} \| \text{prior}) quantifies the information gained from the data in nats. A value near 0 means the data was uninformative for that parameter; a large value means the posterior differs substantially from the prior.

The kNN-based KL estimator is the default because it is fast, requires no density estimation, and works in any dimension.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
...     "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> result = information_gain(idata)
>>> result["mu"] > 0
True

References

.. [1] Kullback, S. & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Math. Stat., 22(1), 79-86. .. [2] Lindley, D. V. (1956). "On a measure of the information provided by an experiment." Annals of Math. Stat., 27(4), 986-1005.

`prior_sensitivity(idata, reference_prior_samples, *, var_names=None, method='kl')` ¶

Quantify how sensitive posteriors are to the choice of prior.

Compares information gain under the actual prior versus a reference prior. High sensitivity means conclusions depend on the prior choice.

Parameters:

Name	Type	Description	Default
`idata`	`DataTree`	ArviZ InferenceData with `"posterior"` and `"prior"` groups.	required
`reference_prior_samples`	`dict[str, ndarray]`	Maps parameter names to 1D numpy arrays of samples from a reference (e.g. vague/uninformative) prior.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters present in posterior, prior, and reference_prior_samples are used.	`None`
`method`	`str`	Divergence measure. Default is `"kl"`.	`'kl'`

Returns:

Type	Description
`dict[str, dict[str, float]]`	Maps parameter names to dicts with: `"actual"`: D(posterior \|\| actual_prior) `"reference"`: D(posterior \|\| reference_prior) `"sensitivity"`: `abs(actual - reference)`

Notes

If sensitivity is high for a parameter, this suggests the data is not very informative for that parameter and conclusions depend on the prior. If sensitivity is low, the data dominates and the prior choice matters little.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
...     "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> ref = {"mu": rng.normal(0, 10, 5000)}
>>> result = prior_sensitivity(idata, ref)
>>> "sensitivity" in result["mu"]
True

`bayesian_surprise(idata, *, log_likelihood_group='log_likelihood', var_name=None)` ¶

Compute per-observation surprise (self-information).

For each observation i, the surprise is:

.. math:: S(y_i) = -\log \mathbb{E}_\theta[p(y_i \mid \theta)]

approximated via the log-sum-exp trick over MCMC draws.

Parameters:

Name	Type	Description	Default
`idata`	`DataTree`	ArviZ InferenceData. Must contain the log-likelihood group.	required
`log_likelihood_group`	`str`	Name of the log-likelihood group. Default is `"log_likelihood"`.	`'log_likelihood'`
`var_name`	`str`	Variable name within the group. If `None` and the group contains exactly one variable, that variable is used.	`None`

Returns:

Type	Description
`dict[str, ndarray]`	Maps variable names to arrays of per-observation surprise values. Higher values indicate more surprising (influential or outlying) observations.

Raises:

Type	Description
`ValueError`	If the log-likelihood group is missing or var_name cannot be resolved.

Notes

Observations with high surprise are:

Influential data points driving the posterior
Potential outliers the model struggles to explain
Regions of model misspecification

This is closely related to the pointwise WAIC/LOO computation but expressed in the information-theoretic framework.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "log_likelihood": {"y": rng.standard_normal((2, 500, 20))}
... })
>>> result = bayesian_surprise(idata)
>>> result["y"].shape
(20,)

References

.. [1] Itti, L. & Baldi, P. (2009). "Bayesian surprise attracts human attention." Vision Research, 49(10), 1295-1306.

`uncertainty_decomposition(idata, *, var_names=None, method='knn', group='posterior_predictive')` ¶

Decompose predictive uncertainty into aleatoric and epistemic.

Uses the information-theoretic decomposition:

.. math:: \underbrace{H[p(y^ \mid y)]}{\text{total}} = \underbrace{\mathbb{E}\theta[H[p(y^ \mid \theta)]]}{\text{aleatoric}} + \underbrace{I(y^*; \theta \mid y)}}

Parameters:

Name	Type	Description	Default
`idata`	`DataTree`	ArviZ InferenceData with the posterior predictive group.	required
`var_names`	`list of str`	Predicted variable names. If `None`, all variables in the group are used.	`None`
`method`	`str`	Entropy estimation method. `"knn"` (default) uses the Kozachenko-Leonenko kNN estimator. `"kde"` uses kernel density estimation.	`'knn'`
`group`	`str`	Predictive group name. Default is `"posterior_predictive"`.	`'posterior_predictive'`

Returns:

Type	Description
`dict[str, dict[str, float]]`	Maps variable names to dicts with: `"total"`: H[p(y* \| y)] — marginal predictive entropy `"aleatoric"`: E_theta[H[p(y* \| theta)]] — average conditional entropy `"epistemic"`: total - aleatoric = I(y*; theta \| y)

Raises:

Type	Description
`ValueError`	If the predictive group is missing or the observation dimension is too small (< 10) for reliable entropy estimation.

Notes

Total entropy is estimated from all posterior predictive samples pooled across chains and draws.

Aleatoric entropy is estimated per MCMC draw: each (chain, draw) pair corresponds to a fixed parameter value theta, and the observation dimension provides the samples from p(y* | theta). The aleatoric estimate is the mean of these per-draw entropies.

This requires the posterior predictive to have an observation dimension (e.g., shape (chain, draw, obs)). If there is only one predicted value per draw, per-draw entropy cannot be estimated.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(0, 1, (2, 500, 50))}
... })
>>> result = uncertainty_decomposition(idata)
>>> result["y"]["epistemic"] >= -0.1
True

References

.. [1] Depeweg, S. et al. (2018). "Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning." ICML.

`model_divergence(idata1, idata2, *, var_names=None, method='energy', group='posterior_predictive')` ¶

Compare predictive distributions from two models.

Computes the divergence between corresponding variables in the specified group of two InferenceData objects.

Parameters:

Name	Type	Description	Default
`idata1`	`DataTree`	Two ArviZ InferenceData objects. Both must contain group.	required
`idata2`	`DataTree`	Two ArviZ InferenceData objects. Both must contain group.	required
`var_names`	`list of str`	Variables to compare. If `None`, all shared variables are used.	`None`
`method`	`str`	Divergence measure. Default is `"energy"` (symmetric, no density estimation). Other options: `"mmd"`, `"js"`, `"kl"`.	`'energy'`
`group`	`str`	Group to compare. Default is `"posterior_predictive"`.	`'posterior_predictive'`

Returns:

Type	Description
`dict[str, float]`	Maps variable names to divergence values.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata1 = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(0, 1, (2, 500))}
... })
>>> idata2 = az.from_dict({
...     "posterior_predictive": {"y": rng.normal(1, 1, (2, 500))}
... })
>>> result = model_divergence(idata1, idata2)
>>> result["y"] > 0
True

Convergence Diagnostics¶

`chain_divergence(idata, *, var_names=None, method='energy', group='posterior')` ¶

Compute pairwise distributional divergence between MCMC chains.

Complements R-hat with a full distributional comparison: R-hat checks whether chains have similar means and variances, while this function checks whether they sample from the same distribution (detecting differences in shape, skewness, and multimodality).

Parameters:

Name	Type	Description	Default
`idata`	`DataTree`	ArviZ InferenceData with multiple chains in the specified group.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters are used.	`None`
`method`	`str`	Divergence measure. Default is `"energy"` (symmetric, works in any dimension, no density estimation). Other options: `"mmd"`, `"js"`, `"kl"`, `"hellinger"`, `"tv"`, `"wasserstein"`.	`'energy'`
`group`	`str`	Group to analyze. Default is `"posterior"`.	`'posterior'`

Returns:

Type	Description
`dict[str, ndarray]`	Maps parameter names to pairwise divergence matrices of shape `(n_chains, n_chains)`. Diagonal entries are zero (or near-zero); entry `(i, j)` is the divergence between chains i and j.

Notes

For symmetric methods (energy, MMD, JS, Hellinger, TV, Wasserstein), the matrix is symmetric. For KL divergence, entry (i, j) is :math:D_{\text{KL}}(\text{chain}_i \| \text{chain}_j).

Well-mixed chains should have small pairwise divergences. Large off-diagonal values indicate chains exploring different regions of parameter space.

For method="mmd" the RBF bandwidth is estimated once from the pooled samples across all chains, not per pair. That gives every entry of the returned matrix a shared scale (which is what users actually want from a pairwise-divergence ranking) and halves the wall time by removing the per-pair median-heuristic call. Users who need per-pair bandwidths can call :func:~divergence.ipms.maximum_mean_discrepancy directly.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_divergence(idata)
>>> result["mu"].shape
(4, 4)

`chain_ksd(idata, score_fn, *, var_names=None, group='posterior', kernel='imq', bandwidth=None, split=True)` ¶

Compute per-chain kernel Stein discrepancy against the target.

Unlike :func:chain_divergence, which tests whether chains agree with each other, this function tests whether each chain has converged to the correct target distribution. It is the only diagnostic that can detect the situation where all chains have converged to the wrong distribution.

The KSD requires only the score function :math:\nabla \log \pi(x) of the target, which is available as a byproduct of gradient-based MCMC (HMC, NUTS).

Parameters:

Name	Type	Description	Default
`idata`	`DataTree or InferenceData`	ArviZ inference data containing the specified group.	required
`score_fn`	`callable`	Score function of the target distribution. Takes an array of shape `(n, d)` and returns an array of shape `(n, d)` with :math:`\nabla_x \log \pi(x)` at each point. For scalar parameters (`d = 1`), the input and output shapes may be `(n, 1)`.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters in the group are used.	`None`
`group`	`str`	InferenceData group. Default is `"posterior"`.	`'posterior'`
`kernel`	`str`	Kernel for the Stein operator: `"imq"` (inverse multiquadric, default) or `"rbf"` (Gaussian). The IMQ kernel provides provable convergence control guarantees [1]_.	`'imq'`
`bandwidth`	`float or None`	Kernel bandwidth / scale parameter. If `None`, the median heuristic is used, computed once from the pooled samples across all chains. Sharing the bandwidth gives every entry a common scale (which is exactly what you want when reading per-chain KSDs as a relative ranking) and removes the dominant per-call cost. Pass an explicit value to override.	`None`
`split`	`bool`	If `True` (default), additionally compute KSD for the first and second halves of each chain. This is analogous to split-:math:`\hat{R}` and can detect non-stationarity: if the first-half KSD is much larger than the second-half KSD, the chain was still converging during the first half.	`True`

Returns:

Type	Description
`dict[str, ChainKSDResult]`	Maps parameter names to :class:`~divergence.ChainKSDResult` named tuples with fields `ksd_per_chain`, `ksd_split_first`, `ksd_split_second`, and `ksd_pooled`.

Raises:

Type	Description
`ImportError`	If ArviZ is not installed.
`ValueError`	If the specified group is missing from idata.

Notes

For the IMQ kernel :math:k(x,y) = (c^2 + \|x-y\|^2)^{-1/2}, Gorham and Mackey (2017) [1]_ proved that :math:\mathrm{KSD}(\mu_n, \pi) \to 0 implies :math:\mu_n \Rightarrow \pi (weak convergence) and tightness of :math:\{\mu_n\}. This is strictly stronger than what :math:\hat{R} can guarantee.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_ksd(idata, lambda x: -x)
>>> result["mu"].ksd_pooled < 0.1
True

References

.. [1] Gorham, J. & Mackey, L. (2017). "Measuring sample quality with kernels." ICML. .. [2] Liu, Q., Lee, J., & Jordan, M. (2016). "A kernelized Stein discrepancy for goodness-of-fit tests." ICML.

`chain_two_sample_test(idata, *, var_names=None, method='mmd', group='posterior', n_permutations=500, seed=None, low_memory=None)` ¶

Pairwise two-sample permutation tests between MCMC chains.

Upgrades :func:chain_divergence from raw divergence magnitudes to calibrated p-values. A small p-value for chains i and j indicates that they are sampling from detectably different distributions, suggesting a convergence failure.

Parameters:

Name	Type	Description	Default
`idata`	`DataTree or InferenceData`	ArviZ inference data containing the specified group.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters are used.	`None`
`method`	`str`	Test statistic: `"mmd"` (default), `"energy"`, or `"kl_knn"`. These correspond to the methods supported by :func:`~divergence.two_sample_test`.	`'mmd'`
`group`	`str`	InferenceData group. Default is `"posterior"`.	`'posterior'`
`n_permutations`	`int`	Number of permutations for the null distribution. Default is 500. Higher values give more precise p-values at increased computational cost.	`500`
`seed`	`int or None`	Base random seed for reproducibility. Per-pair seeds are derived as `seed + i * n_chains + j` to ensure statistical independence across chain pairs while remaining reproducible.	`None`
`low_memory`	`bool or None`	Memory strategy passed to :func:`~divergence.two_sample_test`. `None` (default) auto-detects based on sample size. Set to `True` for long chains (N > ~10K per chain) to avoid materializing the O(N²) distance matrix.	`None`

Returns:

Type	Description
`dict[str, ChainTestResult]`	Maps parameter names to :class:`~divergence.ChainTestResult` named tuples with fields `p_value_matrix`, `statistic_matrix`, `min_p_value`, and `any_significant`.

Raises:

Type	Description
`ImportError`	If ArviZ is not installed.
`ValueError`	If the specified group is missing or method is invalid.

Notes

For well-mixed chains, all pairwise p-values should be large (> 0.05). If any pair has a small p-value, the chains are sampling from detectably different distributions, indicating a convergence problem.

For typical MCMC use (scalar or low-dimensional parameters), the "energy" method is generally faster: it has a 1D sort-based fast path that runs in microseconds even for thousands of draws, and it does not require kernel-bandwidth selection. The "mmd" method is appropriate when you have a specific kernel choice in mind, are working with high-dimensional posteriors, or want to align with a published benchmark that specifies MMD.

For method="mmd" the RBF bandwidth is estimated once from the pooled samples across all chains and reused for every pair — avoids C*(C-1)/2 redundant median-heuristic calls.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_two_sample_test(idata, n_permutations=200, seed=42)
>>> result["mu"].any_significant
False

References

.. [1] Gretton, A. et al. (2012). "A kernel two-sample test." JMLR, 13, 723-773. .. [2] Szekely, G. J. & Rizzo, M. L. (2004). "Testing for equal distributions in high dimension." InterStat.

`mixing_diagnostic(idata, *, var_names=None, group='posterior', lag=1, knn_k=5)` ¶

Diagnose chain mixing using transfer entropy.

Applies :func:~divergence.transfer_entropy to detect two types of mixing failure:

Non-stationarity within chains: If the first half of a chain predicts the second half (transfer entropy > 0), the chain has not yet reached its stationary distribution.
Spurious dependence between chains: If one chain's trace predicts another's (transfer entropy > 0), the chains are not truly independent — indicating shared non-stationarity or coupling artifacts.

Parameters:

Name	Type	Description	Default
`idata`	`DataTree or InferenceData`	ArviZ inference data containing the specified group.	required
`var_names`	`list of str`	Parameters to analyze. If `None`, all parameters are used.	`None`
`group`	`str`	InferenceData group. Default is `"posterior"`.	`'posterior'`
`lag`	`int`	Embedding dimension for the target series in the transfer entropy computation. Default is 1.	`1`
`knn_k`	`int`	Number of nearest neighbors for the kNN entropy estimator used internally by transfer entropy. Default is 5.	`5`

Returns:

Type	Description
`dict[str, MixingDiagnostic]`	Maps parameter names to :class:`~divergence.MixingDiagnostic` named tuples with fields `stationarity_te` (shape `(n_chains,)`) and `cross_chain_te` (shape `(n_chains - 1,)`).

Raises:

Type	Description
`ImportError`	If ArviZ is not installed.
`ValueError`	If the group is missing or chains are too short for the requested embedding dimensions.

Notes

For vector parameters (shape (chain, draw, K)), transfer entropy is computed for each of the K components separately and the results are averaged. This avoids the curse of dimensionality in kNN entropy estimation.

Transfer entropy is related to Granger causality: for jointly Gaussian processes the two are equivalent [2]_. The key advantage is that TE is fully nonparametric — it detects any form of directed statistical dependence.

Examples:

>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
...     "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = mixing_diagnostic(idata)
>>> result["mu"].stationarity_te.shape
(4,)

References

.. [1] Schreiber, T. (2000). "Measuring information transfer." Physical Review Letters, 85(2), 461-464. .. [2] Barnett, L., Barrett, A. B., & Seth, A. K. (2009). "Granger causality and transfer entropy are equivalent for Gaussian variables." Physical Review Letters, 103(23), 238701.

Bayesian Diagnostics¶

Inference Diagnostics¶

information_gain(idata, *, var_names=None, method='kl', multivariate=False) ¶

prior_sensitivity(idata, reference_prior_samples, *, var_names=None, method='kl') ¶

bayesian_surprise(idata, *, log_likelihood_group='log_likelihood', var_name=None) ¶

uncertainty_decomposition(idata, *, var_names=None, method='knn', group='posterior_predictive') ¶

model_divergence(idata1, idata2, *, var_names=None, method='energy', group='posterior_predictive') ¶

Convergence Diagnostics¶

chain_divergence(idata, *, var_names=None, method='energy', group='posterior') ¶

chain_ksd(idata, score_fn, *, var_names=None, group='posterior', kernel='imq', bandwidth=None, split=True) ¶

chain_two_sample_test(idata, *, var_names=None, method='mmd', group='posterior', n_permutations=500, seed=None, low_memory=None) ¶

mixing_diagnostic(idata, *, var_names=None, group='posterior', lag=1, knn_k=5) ¶

`information_gain(idata, *, var_names=None, method='kl', multivariate=False)` ¶

`prior_sensitivity(idata, reference_prior_samples, *, var_names=None, method='kl')` ¶

`bayesian_surprise(idata, *, log_likelihood_group='log_likelihood', var_name=None)` ¶

`uncertainty_decomposition(idata, *, var_names=None, method='knn', group='posterior_predictive')` ¶

`model_divergence(idata1, idata2, *, var_names=None, method='energy', group='posterior_predictive')` ¶

`chain_divergence(idata, *, var_names=None, method='energy', group='posterior')` ¶

`chain_ksd(idata, score_fn, *, var_names=None, group='posterior', kernel='imq', bandwidth=None, split=True)` ¶

`chain_two_sample_test(idata, *, var_names=None, method='mmd', group='posterior', n_permutations=500, seed=None, low_memory=None)` ¶

`mixing_diagnostic(idata, *, var_names=None, group='posterior', lag=1, knn_k=5)` ¶