Bayesian Diagnostics¶
ArviZ integration for MCMC convergence assessment and Bayesian inference diagnostics. All functions accept ArviZ InferenceData or xarray.DataTree objects.
Inference Diagnostics¶
information_gain(idata, *, var_names=None, method='kl', multivariate=False)
¶
Compute how much the data updated beliefs about each parameter.
Measures D(posterior || prior) per parameter using the specified divergence. This is the information-theoretic analogue of asking "how informative was the data?"
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree
|
ArviZ InferenceData ( |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
method
|
str
|
Divergence measure: |
'kl'
|
multivariate
|
bool
|
If |
False
|
Returns:
| Type | Description |
|---|---|
dict[str, float | ndarray]
|
Maps parameter names to divergence values. Scalar parameters
produce |
Raises:
| Type | Description |
|---|---|
ImportError
|
If ArviZ is not installed. |
ValueError
|
If required groups are missing or method is invalid. |
KeyError
|
If a requested variable is not found in both groups. |
Notes
The KL divergence :math:D_{\text{KL}}(\text{posterior} \| \text{prior})
quantifies the information gained from the data in nats. A value
near 0 means the data was uninformative for that parameter; a large
value means the posterior differs substantially from the prior.
The kNN-based KL estimator is the default because it is fast, requires no density estimation, and works in any dimension.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
... "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> result = information_gain(idata)
>>> result["mu"] > 0
True
References
.. [1] Kullback, S. & Leibler, R. A. (1951). "On Information and Sufficiency." Annals of Math. Stat., 22(1), 79-86. .. [2] Lindley, D. V. (1956). "On a measure of the information provided by an experiment." Annals of Math. Stat., 27(4), 986-1005.
prior_sensitivity(idata, reference_prior_samples, *, var_names=None, method='kl')
¶
Quantify how sensitive posteriors are to the choice of prior.
Compares information gain under the actual prior versus a reference prior. High sensitivity means conclusions depend on the prior choice.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree
|
ArviZ InferenceData with |
required |
reference_prior_samples
|
dict[str, ndarray]
|
Maps parameter names to 1D numpy arrays of samples from a reference (e.g. vague/uninformative) prior. |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
method
|
str
|
Divergence measure. Default is |
'kl'
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Maps parameter names to dicts with:
|
Notes
If sensitivity is high for a parameter, this suggests the data is not very informative for that parameter and conclusions depend on the prior. If sensitivity is low, the data dominates and the prior choice matters little.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior": {"mu": rng.normal(1, 0.5, (2, 500))},
... "prior": {"mu": rng.normal(0, 1, (2, 500))},
... })
>>> ref = {"mu": rng.normal(0, 10, 5000)}
>>> result = prior_sensitivity(idata, ref)
>>> "sensitivity" in result["mu"]
True
bayesian_surprise(idata, *, log_likelihood_group='log_likelihood', var_name=None)
¶
Compute per-observation surprise (self-information).
For each observation i, the surprise is:
.. math:: S(y_i) = -\log \mathbb{E}_\theta[p(y_i \mid \theta)]
approximated via the log-sum-exp trick over MCMC draws.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree
|
ArviZ InferenceData. Must contain the log-likelihood group. |
required |
log_likelihood_group
|
str
|
Name of the log-likelihood group. Default is |
'log_likelihood'
|
var_name
|
str
|
Variable name within the group. If |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, ndarray]
|
Maps variable names to arrays of per-observation surprise values. Higher values indicate more surprising (influential or outlying) observations. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the log-likelihood group is missing or var_name cannot be resolved. |
Notes
Observations with high surprise are:
- Influential data points driving the posterior
- Potential outliers the model struggles to explain
- Regions of model misspecification
This is closely related to the pointwise WAIC/LOO computation but expressed in the information-theoretic framework.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "log_likelihood": {"y": rng.standard_normal((2, 500, 20))}
... })
>>> result = bayesian_surprise(idata)
>>> result["y"].shape
(20,)
References
.. [1] Itti, L. & Baldi, P. (2009). "Bayesian surprise attracts human attention." Vision Research, 49(10), 1295-1306.
uncertainty_decomposition(idata, *, var_names=None, method='knn', group='posterior_predictive')
¶
Decompose predictive uncertainty into aleatoric and epistemic.
Uses the information-theoretic decomposition:
.. math:: \underbrace{H[p(y^ \mid y)]}{\text{total}} = \underbrace{\mathbb{E}\theta[H[p(y^ \mid \theta)]]}{\text{aleatoric}} + \underbrace{I(y^*; \theta \mid y)}}
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree
|
ArviZ InferenceData with the posterior predictive group. |
required |
var_names
|
list of str
|
Predicted variable names. If |
None
|
method
|
str
|
Entropy estimation method. |
'knn'
|
group
|
str
|
Predictive group name. Default is |
'posterior_predictive'
|
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, float]]
|
Maps variable names to dicts with:
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the predictive group is missing or the observation dimension is too small (< 10) for reliable entropy estimation. |
Notes
Total entropy is estimated from all posterior predictive samples pooled across chains and draws.
Aleatoric entropy is estimated per MCMC draw: each
(chain, draw) pair corresponds to a fixed parameter value
theta, and the observation dimension provides the samples from
p(y* | theta). The aleatoric estimate is the mean of these
per-draw entropies.
This requires the posterior predictive to have an observation
dimension (e.g., shape (chain, draw, obs)). If there is only
one predicted value per draw, per-draw entropy cannot be estimated.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior_predictive": {"y": rng.normal(0, 1, (2, 500, 50))}
... })
>>> result = uncertainty_decomposition(idata)
>>> result["y"]["epistemic"] >= -0.1
True
References
.. [1] Depeweg, S. et al. (2018). "Decomposition of Uncertainty in Bayesian Deep Learning for Efficient and Risk-sensitive Learning." ICML.
model_divergence(idata1, idata2, *, var_names=None, method='energy', group='posterior_predictive')
¶
Compare predictive distributions from two models.
Computes the divergence between corresponding variables in the specified group of two InferenceData objects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata1
|
DataTree
|
Two ArviZ InferenceData objects. Both must contain group. |
required |
idata2
|
DataTree
|
Two ArviZ InferenceData objects. Both must contain group. |
required |
var_names
|
list of str
|
Variables to compare. If |
None
|
method
|
str
|
Divergence measure. Default is |
'energy'
|
group
|
str
|
Group to compare. Default is |
'posterior_predictive'
|
Returns:
| Type | Description |
|---|---|
dict[str, float]
|
Maps variable names to divergence values. |
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata1 = az.from_dict({
... "posterior_predictive": {"y": rng.normal(0, 1, (2, 500))}
... })
>>> idata2 = az.from_dict({
... "posterior_predictive": {"y": rng.normal(1, 1, (2, 500))}
... })
>>> result = model_divergence(idata1, idata2)
>>> result["y"] > 0
True
Convergence Diagnostics¶
chain_divergence(idata, *, var_names=None, method='energy', group='posterior')
¶
Compute pairwise distributional divergence between MCMC chains.
Complements R-hat with a full distributional comparison: R-hat checks whether chains have similar means and variances, while this function checks whether they sample from the same distribution (detecting differences in shape, skewness, and multimodality).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree
|
ArviZ InferenceData with multiple chains in the specified group. |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
method
|
str
|
Divergence measure. Default is |
'energy'
|
group
|
str
|
Group to analyze. Default is |
'posterior'
|
Returns:
| Type | Description |
|---|---|
dict[str, ndarray]
|
Maps parameter names to pairwise divergence matrices of shape
|
Notes
For symmetric methods (energy, MMD, JS, Hellinger, TV, Wasserstein),
the matrix is symmetric. For KL divergence, entry (i, j) is
:math:D_{\text{KL}}(\text{chain}_i \| \text{chain}_j).
Well-mixed chains should have small pairwise divergences. Large off-diagonal values indicate chains exploring different regions of parameter space.
For method="mmd" the RBF bandwidth is estimated once from the
pooled samples across all chains, not per pair. That gives every
entry of the returned matrix a shared scale (which is what users
actually want from a pairwise-divergence ranking) and halves the
wall time by removing the per-pair median-heuristic call. Users
who need per-pair bandwidths can call
:func:~divergence.ipms.maximum_mean_discrepancy directly.
Examples:
chain_ksd(idata, score_fn, *, var_names=None, group='posterior', kernel='imq', bandwidth=None, split=True)
¶
Compute per-chain kernel Stein discrepancy against the target.
Unlike :func:chain_divergence, which tests whether chains agree
with each other, this function tests whether each chain has
converged to the correct target distribution. It is the only
diagnostic that can detect the situation where all chains have
converged to the wrong distribution.
The KSD requires only the score function
:math:\nabla \log \pi(x) of the target, which is available as a
byproduct of gradient-based MCMC (HMC, NUTS).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree or InferenceData
|
ArviZ inference data containing the specified group. |
required |
score_fn
|
callable
|
Score function of the target distribution. Takes an array of
shape |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
group
|
str
|
InferenceData group. Default is |
'posterior'
|
kernel
|
str
|
Kernel for the Stein operator: |
'imq'
|
bandwidth
|
float or None
|
Kernel bandwidth / scale parameter. If |
None
|
split
|
bool
|
If |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, ChainKSDResult]
|
Maps parameter names to :class: |
Raises:
| Type | Description |
|---|---|
ImportError
|
If ArviZ is not installed. |
ValueError
|
If the specified group is missing from idata. |
Notes
For the IMQ kernel :math:k(x,y) = (c^2 + \|x-y\|^2)^{-1/2},
Gorham and Mackey (2017) [1]_ proved that
:math:\mathrm{KSD}(\mu_n, \pi) \to 0 implies
:math:\mu_n \Rightarrow \pi (weak convergence) and tightness of
:math:\{\mu_n\}. This is strictly stronger than what
:math:\hat{R} can guarantee.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_ksd(idata, lambda x: -x)
>>> result["mu"].ksd_pooled < 0.1
True
References
.. [1] Gorham, J. & Mackey, L. (2017). "Measuring sample quality with kernels." ICML. .. [2] Liu, Q., Lee, J., & Jordan, M. (2016). "A kernelized Stein discrepancy for goodness-of-fit tests." ICML.
chain_two_sample_test(idata, *, var_names=None, method='mmd', group='posterior', n_permutations=500, seed=None, low_memory=None)
¶
Pairwise two-sample permutation tests between MCMC chains.
Upgrades :func:chain_divergence from raw divergence magnitudes to
calibrated p-values. A small p-value for chains i and j
indicates that they are sampling from detectably different
distributions, suggesting a convergence failure.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree or InferenceData
|
ArviZ inference data containing the specified group. |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
method
|
str
|
Test statistic: |
'mmd'
|
group
|
str
|
InferenceData group. Default is |
'posterior'
|
n_permutations
|
int
|
Number of permutations for the null distribution. Default is 500. Higher values give more precise p-values at increased computational cost. |
500
|
seed
|
int or None
|
Base random seed for reproducibility. Per-pair seeds are
derived as |
None
|
low_memory
|
bool or None
|
Memory strategy passed to :func: |
None
|
Returns:
| Type | Description |
|---|---|
dict[str, ChainTestResult]
|
Maps parameter names to :class: |
Raises:
| Type | Description |
|---|---|
ImportError
|
If ArviZ is not installed. |
ValueError
|
If the specified group is missing or method is invalid. |
Notes
For well-mixed chains, all pairwise p-values should be large (> 0.05). If any pair has a small p-value, the chains are sampling from detectably different distributions, indicating a convergence problem.
For typical MCMC use (scalar or low-dimensional parameters), the
"energy" method is generally faster: it has a 1D sort-based
fast path that runs in microseconds even for thousands of draws,
and it does not require kernel-bandwidth selection. The "mmd"
method is appropriate when you have a specific kernel choice in
mind, are working with high-dimensional posteriors, or want to
align with a published benchmark that specifies MMD.
For method="mmd" the RBF bandwidth is estimated once from the
pooled samples across all chains and reused for every pair —
avoids C*(C-1)/2 redundant median-heuristic calls.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = chain_two_sample_test(idata, n_permutations=200, seed=42)
>>> result["mu"].any_significant
False
References
.. [1] Gretton, A. et al. (2012). "A kernel two-sample test." JMLR, 13, 723-773. .. [2] Szekely, G. J. & Rizzo, M. L. (2004). "Testing for equal distributions in high dimension." InterStat.
mixing_diagnostic(idata, *, var_names=None, group='posterior', lag=1, knn_k=5)
¶
Diagnose chain mixing using transfer entropy.
Applies :func:~divergence.transfer_entropy to detect two types
of mixing failure:
-
Non-stationarity within chains: If the first half of a chain predicts the second half (transfer entropy > 0), the chain has not yet reached its stationary distribution.
-
Spurious dependence between chains: If one chain's trace predicts another's (transfer entropy > 0), the chains are not truly independent — indicating shared non-stationarity or coupling artifacts.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
idata
|
DataTree or InferenceData
|
ArviZ inference data containing the specified group. |
required |
var_names
|
list of str
|
Parameters to analyze. If |
None
|
group
|
str
|
InferenceData group. Default is |
'posterior'
|
lag
|
int
|
Embedding dimension for the target series in the transfer entropy computation. Default is 1. |
1
|
knn_k
|
int
|
Number of nearest neighbors for the kNN entropy estimator used internally by transfer entropy. Default is 5. |
5
|
Returns:
| Type | Description |
|---|---|
dict[str, MixingDiagnostic]
|
Maps parameter names to :class: |
Raises:
| Type | Description |
|---|---|
ImportError
|
If ArviZ is not installed. |
ValueError
|
If the group is missing or chains are too short for the requested embedding dimensions. |
Notes
For vector parameters (shape (chain, draw, K)), transfer
entropy is computed for each of the K components separately and
the results are averaged. This avoids the curse of dimensionality
in kNN entropy estimation.
Transfer entropy is related to Granger causality: for jointly Gaussian processes the two are equivalent [2]_. The key advantage is that TE is fully nonparametric — it detects any form of directed statistical dependence.
Examples:
>>> import numpy as np
>>> import arviz as az
>>> rng = np.random.default_rng(42)
>>> idata = az.from_dict({
... "posterior": {"mu": rng.normal(0, 1, (4, 500))}
... })
>>> result = mixing_diagnostic(idata)
>>> result["mu"].stationarity_te.shape
(4,)
References
.. [1] Schreiber, T. (2000). "Measuring information transfer." Physical Review Letters, 85(2), 461-464. .. [2] Barnett, L., Barrett, A. B., & Seth, A. K. (2009). "Granger causality and transfer entropy are equivalent for Gaussian variables." Physical Review Letters, 103(23), 238701.