ABC random forests for Bayesian parameter inference

MOTIVATION Approximate Bayesian computation (ABC) has grown into a standard methodology that manages Bayesian inference for models associated with intractable likelihood functions. Most ABC implementations require the preliminary selection of a vector of informative statistics summarizing raw data. Furthermore, in almost all existing implementations, the tolerance level that separates acceptance from rejection of simulated parameter values needs to be calibrated. RESULTS We propose to conduct likelihood-free Bayesian inferences about parameters with no prior selection of the relevant components of the summary statistics and bypassing the derivation of the associated tolerance level. The approach relies on the random forest (RF) methodology of Breiman (2001) applied in a (non-parametric) regression setting. We advocate the derivation of a new RF for each component of the parameter vector of interest. When compared with earlier ABC solutions, this method offers significant gains in terms of robustness to the choice of the summary statistics, does not depend on any type of tolerance level, and is a good trade-off in term of quality of point estimator precision and credible interval estimations for a given computing time. We illustrate the performance of our methodological proposal and compare it with earlier ABC methods on a Normal toy example and a population genetics example dealing with human population evolution. AVAILABILITY AND IMPLEMENTATION All methods designed here have been incorporated in the R package abcrf (version 1.7.1) available on CRAN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Iain Murray,et al.  Fast $\epsilon$-free Inference of Simulation Models with Bayesian Conditional Density Estimation , 2016, 1605.06376.

[2]  David T. Frazier,et al.  Asymptotic properties of approximate Bayesian computation , 2016, Biometrika.

[3]  Arnaud Doucet,et al.  An adaptive sequential Monte Carlo method for approximate Bayesian computation , 2011, Statistics and Computing.

[4]  Jason M. Klusowski Complete Analysis of a Random Forest Model , 2018, ArXiv.

[5]  Mark M. Tanaka,et al.  Sequential Monte Carlo without likelihoods , 2007, Proceedings of the National Academy of Sciences.

[6]  Carlo Gaetan,et al.  Composite likelihood methods for space-time data , 2006 .

[7]  Olivier François,et al.  Non-linear regression models for Approximate Bayesian Computation , 2008, Stat. Comput..

[8]  Kenny Q. Ye,et al.  An integrated map of genetic variation from 1,092 human genomes , 2012, Nature.

[9]  Dennis Prangle,et al.  Adapting the ABC distance function , 2015, 1507.00874.

[10]  Nicolai Meinshausen,et al.  Quantile Regression Forests , 2006, J. Mach. Learn. Res..

[11]  Marcus W. Feldman,et al.  The great human expansion , 2012, Resonance.

[12]  M. Feldman,et al.  Population growth of human Y chromosomes: a study of Y chromosome microsatellites. , 1999, Molecular biology and evolution.

[13]  J. Møller Discussion on the paper by Feranhead and Prangle , 2012 .

[14]  David Welch,et al.  Approximate Bayesian computation scheme for parameter inference and model selection in dynamical systems , 2009, Journal of The Royal Society Interface.

[15]  Jean-Michel Marin,et al.  Approximate Bayesian computational methods , 2011, Statistics and Computing.

[16]  Yanan Fan,et al.  Handbook of Approximate Bayesian Computation , 2018 .

[17]  S. Sisson,et al.  A comparative review of dimension reduction methods in approximate Bayesian computation , 2012, 1202.3819.

[18]  C. Bustamante,et al.  RFMix: a discriminative modeling approach for rapid and robust local-ancestry inference. , 2013, American journal of human genetics.

[19]  O. François,et al.  Approximate Bayesian Computation (ABC) in practice. , 2010, Trends in ecology & evolution.

[20]  Scott M. Williams,et al.  The Great Migration and African-American Genomic Diversity , 2015, bioRxiv.

[21]  Mark A. Beaumont,et al.  Joint determination of topology, divergence time, and immigration in population trees , 2008 .

[22]  Leonhard Held,et al.  Spatio-Temporal Analysis of Epidemic Phenomena Using the R Package surveillance , 2014, ArXiv.

[23]  Arnaud Guyader,et al.  New insights into Approximate Bayesian Computation , 2012, 1207.6461.

[24]  Paul Fearnhead,et al.  Constructing Summary Statistics for Approximate Bayesian Computation: Semi-automatic ABC , 2010, 1004.1112.

[25]  Ryan D. Hernandez,et al.  Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data , 2009, PLoS genetics.

[26]  Matthew A. Nunes,et al.  abctools: An R Package for Tuning Approximate Bayesian Computation Analyses , 2015, R J..

[27]  Jean-Marie Cornuet,et al.  ABC model choice via random forests , 2014, 1406.6288.

[28]  John A. Rogersa,et al.  Correction for ‘ ‘ Sequential Monte Carlo without likelihoods , 2009 .

[29]  Jean-Marie Cornuet,et al.  DIYABC v2.0: a software to make approximate Bayesian computation inferences about population history using single nucleotide polymorphism, DNA sequence and microsatellite data , 2014, Bioinform..

[30]  Richard Durbin,et al.  Inferring human population size and separation history from multiple genome sequences , 2014 .

[31]  M. Beaumont,et al.  CodABC: A Computational Framework to Coestimate Recombination, Substitution, and Molecular Adaptation Rates by Approximate Bayesian Computation , 2015, Molecular biology and evolution.

[32]  Jan Hasenauer,et al.  pyABC: distributed, likelihood-free inference , 2017, bioRxiv.

[33]  Michael J. Hickerson,et al.  Detecting Concerted Demographic Response across Community Assemblages Using Hierarchical Approximate Bayesian Computation , 2014, Molecular biology and evolution.

[34]  Michael Lachmann,et al.  Inferring the history of population size change from genome-wide SNP data. , 2012, Molecular biology and evolution.

[35]  N. Reid,et al.  AN OVERVIEW OF COMPOSITE LIKELIHOOD METHODS , 2011 .

[36]  D. Balding,et al.  Approximate Bayesian computation in population genetics. , 2002, Genetics.

[37]  M. Beaumont Approximate Bayesian Computation in Evolution and Ecology , 2010 .

[38]  Jean-Michel Marin,et al.  Bayesian Essentials with R , 2013 .

[39]  Jean-Michel Marin,et al.  Likelihood-Free Model Choice , 2015, Handbook of Approximate Bayesian Computation.

[40]  L. Excoffier,et al.  Robust Demographic Inference from Genomic and SNP Data , 2013, PLoS genetics.

[41]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[42]  David Reich,et al.  The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States , 2015, American journal of human genetics.

[43]  Paul Marjoram,et al.  Statistical Applications in Genetics and Molecular Biology Approximately Sufficient Statistics and Bayesian Computation , 2011 .

[44]  Paul Fearnhead,et al.  On the Asymptotic Efficiency of ABC Estimators , 2015 .

[45]  P. Donnelly,et al.  Inferring coalescence times from DNA sequence data. , 1997, Genetics.

[46]  C. Bishop Mixture density networks , 1994 .

[47]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[48]  Andreas Ziegler,et al.  ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R , 2015, 1508.04409.

[49]  Jan Hasenauer,et al.  A Scheme for Adaptive Selection of Population Sizes in Approximate Bayesian Computation - Sequential Monte Carlo , 2017, CMSB.

[50]  Jean-Marie Cornuet,et al.  Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation , 2008, Bioinform..

[51]  Saso Dzeroski,et al.  Ensembles of Multi-Objective Decision Trees , 2007, ECML.

[52]  Paul Marjoram,et al.  Choice of Summary Statistic Weights in Approximate Bayesian Computation , 2011, Statistical applications in genetics and molecular biology.

[53]  Katalin Csill'ery,et al.  abc: an R package for approximate Bayesian computation (ABC) , 2011, 1106.2793.

[54]  D. Balding,et al.  Statistical Applications in Genetics and Molecular Biology On Optimal Selection of Summary Statistics for Approximate Bayesian Computation , 2011 .

[55]  Olivier Gascuel,et al.  Inferring epidemiological parameters from phylogenies using regression-ABC: A comparative study , 2017, PLoS Comput. Biol..