Statistical Applications in Genetics and Molecular Biology On Optimal Selection of Summary Statistics for Approximate Bayesian Computation

How best to summarize large and complex datasets is a problem that arises in many areas of science. We approach it from the point of view of seeking data summaries that minimize the average squared error of the posterior distribution for a parameter of interest under approximate Bayesian computation (ABC). In ABC, simulation under the model replaces computation of the likelihood, which is convenient for many complex models. Simulated and observed datasets are usually compared using summary statistics, typically in practice chosen on the basis of the investigator's intuition and established practice in the field. We propose two algorithms for automated choice of efficient data summaries. Firstly, we motivate minimisation of the estimated entropy of the posterior approximation as a heuristic for the selection of summary statistics. Secondly, we propose a two-stage procedure: the minimum-entropy algorithm is used to identify simulated datasets close to that observed, and these are each successively regarded as observed datasets for which the mean root integrated squared error of the ABC posterior approximation is minimized over sets of summary statistics. In a simulation study, we both singly and jointly inferred the scaled mutation and recombination parameters from a population sample of DNA sequences. The computationally-fast minimum entropy algorithm showed a modest improvement over existing methods while our two-stage procedure showed substantial and highly-significant further improvement for both univariate and bivariate inferences. We found that the optimal set of summary statistics was highly dataset specific, suggesting that more generally there may be no globally-optimal choice, which argues for a new selection for each dataset even if the model and target of inference are unchanged.

[1]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[2]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[3]  Yuriy G. Dmitriev,et al.  On the Estimation of Functionals of the Probability Density and Its Derivatives , 1974 .

[4]  Oldrich A Vasicek,et al.  A Test for Normality Based on Sample Entropy , 1976 .

[5]  Ibrahim A. Ahmad,et al.  A nonparametric estimation of the entropy for absolutely continuous distributions (Corresp.) , 1976, IEEE Trans. Inf. Theory.

[6]  Noel A Cressie,et al.  On the logarithms of high-order spacings , 1976 .

[7]  M. Nordborg,et al.  Coalescent Theory , 2019, Handbook of Statistical Genomics.

[8]  L. Györfi,et al.  Density-free convergence properties of various estimators of entropy , 1987 .

[9]  P. Hall On Kullback-Leibler loss and density estimation , 1987 .

[10]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[11]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[12]  P. Hall,et al.  On the estimation of entropy , 1993 .

[13]  A. Tsybakov,et al.  Root-N consistent estimators of entropy for densities with unbounded support , 1994, Proceedings of 1994 Workshop on Information Theory and Statistics.

[14]  L. Györfi,et al.  Nonparametric entropy estimation. An overview , 1997 .

[15]  P. Donnelly,et al.  Inferring coalescence times from DNA sequence data. , 1997, Genetics.

[16]  Jianqing Fan,et al.  Efficient Estimation of Conditional Variance Functions in Stochastic Regression , 1998 .

[17]  Nader Ebrahimi,et al.  Ordering univariate distributions by entropy and variance , 1999 .

[18]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[19]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[20]  Harshinder Singh,et al.  Nearest Neighbor Estimates of Entropy , 2003 .

[21]  Paul Marjoram,et al.  Markov chain Monte Carlo without likelihoods , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  M. C. Jones,et al.  Likelihood-Based Local Linear Estimation of the Conditional Variance Function , 2004 .

[23]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[24]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[25]  Michael J. Hickerson,et al.  msBayes: Pipeline for testing comparative phylogeographic histories using hierarchical approximate Bayesian computation , 2007, BMC Bioinformatics.

[26]  Ron Wehrens,et al.  The pls Package: Principal Component and Partial Least Squares Regression in R , 2007 .

[27]  Carsten Wiuf,et al.  Using Likelihood-Free Inference to Compare Evolutionary Dynamics of the Protein Networks of H. pylori and P. falciparum , 2007, PLoS Comput. Biol..

[28]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[29]  L. Excoffier,et al.  Statistical evaluation of alternative models of human evolution , 2007, Proceedings of the National Academy of Sciences.

[30]  Mark M. Tanaka,et al.  Sequential Monte Carlo without likelihoods , 2007, Proceedings of the National Academy of Sciences.

[31]  Noah A. Rosenberg,et al.  Demographic History of European Populations of Arabidopsis thaliana , 2008, PLoS genetics.

[32]  David Welch,et al.  Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems , 2009, Journal of The Royal Society Interface.

[33]  Jean-Marie Cornuet,et al.  Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation , 2008, Bioinform..

[34]  Paul Marjoram,et al.  Statistical Applications in Genetics and Molecular Biology Approximately Sufficient Statistics and Bayesian Computation , 2011 .

[35]  Joao S. Lopes,et al.  PopABC: a program to infer historical demographic parameters , 2009, Bioinform..

[36]  Andrew R. Francis,et al.  The epidemiological fitness cost of drug resistance in Mycobacterium tuberculosis , 2009, Proceedings of the National Academy of Sciences.

[37]  C. Robert,et al.  Adaptive approximate Bayesian computation , 2008, 0805.2256.

[38]  L. Excoffier,et al.  Efficient Approximate Bayesian Computation Coupled With Markov Chain Monte Carlo Without Likelihood , 2009, Genetics.

[39]  Laurent Excoffier,et al.  ABCtoolbox: a versatile toolkit for approximate Bayesian computations , 2010, BMC Bioinformatics.

[40]  Alex R Cook,et al.  The International Journal of Biostatistics Inference in Epidemic Models without Likelihoods , 2011 .

[41]  Olivier François,et al.  Non-linear regression models for Approximate Bayesian Computation , 2008, Stat. Comput..