Bayesian posterior predictive p-value of statistical consistency in interlaboratory evaluations

The results from an interlaboratory evaluation are said to be statistically consistent if they fit a normal (Gaussian) consistency model which postulates that the results have the same unknown expected value and stated variances–covariances. A modern method for checking the fit of a statistical model to the data is posterior predictive checking, which is a Bayesian adaptation of classical hypothesis testing. In this paper we propose the use of posterior predictive checking to check the fit of the normal consistency model to interlaboratory results. If the model fits reasonably then the results may be regarded as statistically consistent. The principle of posterior predictive checking is that the realized results should look plausible under a posterior predictive distribution. A posterior predictive distribution is the conditional distribution of potential results, given the realized results, which could be obtained in contemplated replications of the interlaboratory evaluation under the statistical model. A systematic discrepancy between potential results obtained from the posterior predictive distribution and the realized results indicates a potential failing of the model. One can investigate any number of potential discrepancies between the model and the results. We discuss an overall measure of discrepancy for checking the consistency of a set of interlaboratory results. We also discuss two sets of unilateral and bilateral measures of discrepancy. A unilateral discrepancy measure checks whether the result of a particular laboratory agrees with the statistical consistency model. A bilateral discrepancy measure checks whether the results of a particular pair of laboratories agree with each other. The degree of agreement is quantified by the Bayesian posterior predictive p-value. The unilateral and bilateral measures of discrepancy and their posterior predictive p-values discussed in this paper apply to both correlated and independent interlaboratory results. We suggest that the posterior predicative p-values may be used to assess unilateral and bilateral degrees of agreement in International Committee of Weights and Measures (CIPM) key comparisons.