Statistical Applications in Genetics and Molecular Biology Approximately Sufficient Statistics and Bayesian Computation

The analysis of high-dimensional data sets is often forced to rely upon well-chosen summary statistics. A systematic approach to choosing such statistics, which is based upon a sound theoretical framework, is currently lacking. In this paper we develop a sequential scheme for scoring statistics according to whether their inclusion in the analysis will substantially improve the quality of inference. Our method can be applied to high-dimensional data sets for which exact likelihood equations are not possible. We illustrate the potential of our approach with a series of examples drawn from genetics. In summary, in a context in which well-chosen summary statistics are of high importance, we attempt to put the `well' into `chosen.'

[1]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[2]  W. Li,et al.  Estimating the age of the common ancestor of a sample of DNA sequences. , 1997, Molecular biology and evolution.

[3]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[4]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[5]  Simon Tavaré,et al.  Approximate Bayesian Computation and MCMC , 2004 .

[6]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[7]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[8]  Mark M. Tanaka,et al.  Sequential Monte Carlo without likelihoods , 2007, Proceedings of the National Academy of Sciences.

[9]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[10]  S. Tavaré,et al.  Modern computational approaches for analysing molecular genetic variation data , 2006, Nature Reviews Genetics.

[11]  M. Nordborg,et al.  Coalescent Theory , 2019, Handbook of Statistical Genomics.

[12]  Paul Marjoram,et al.  Markov chain Monte Carlo without likelihoods , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Hideki Innan,et al.  Statistical Tests of the Coalescent Model Based on the Haplotype Frequency Distribution and the Number of Segregating Sites , 2005, Genetics.

[14]  L. L. Cam,et al.  Sufficiency and Approximate Sufficiency , 1964 .

[15]  C. J-F,et al.  THE COALESCENT , 1980 .

[16]  Paul Joyce,et al.  Partition structures and sufficient statistics , 1998, Journal of Applied Probability.

[17]  Brian D. Ripley,et al.  Stochastic Simulation , 2005 .