Reuse, Recycle, Reweigh: Combating Influenza through Efficient Sequential Bayesian Computation for Massive Data.

Massive datasets in the gigabyte and terabyte range combined with the availability of increasingly sophisticated statistical tools yield analyses at the boundary of what is computationally feasible. Compromising in the face of this computational burden by partitioning the dataset into more tractable sizes results in stratified analyses, removed from the context that justified the initial data collection. In a Bayesian framework, these stratified analyses generate intermediate realizations, often compared using point estimates that fail to account for the variability within and correlation between the distributions these realizations approximate. However, although the initial concession to stratify generally precludes the more sensible analysis using a single joint hierarchical model, we can circumvent this outcome and capitalize on the intermediate realizations by extending the dynamic iterative reweighting MCMC algorithm. In doing so, we reuse the available realizations by reweighting them with importance weights, recycling them into a now tractable joint hierarchical model. We apply this technique to intermediate realizations generated from stratified analyses of 687 influenza A genomes spanning 13 years allowing us to revisit hypotheses regarding the evolutionary history of influenza within a hierarchical statistical framework.

[1]  Douglas G Altman,et al.  Bayesian random effects meta‐analysis of trials with binary outcomes: methods for the absolute risk difference and relative risk scales by D. E. Warn, S. G. Thompson and D. J. Spiegelhalter, Statistics in Medicine 2002; 21: 1601–1623 , 2005, Statistics in medicine.

[2]  J. Carlin Meta-analysis for 2 x 2 tables: a Bayesian approach. , 1992, Statistics in medicine.

[3]  L. Wasserman,et al.  The Selection of Prior Distributions by Formal Rules , 1996 .

[4]  Noel A Cressie,et al.  Massive data sets: problems and possibilities with application to environmental modeling , 1996 .

[5]  Petros Dellaportas,et al.  On Bayesian model and variable selection using MCMC , 2002, Stat. Comput..

[6]  B. Efron Why Isn't Everyone a Bayesian? , 1986 .

[7]  O. Pybus,et al.  Bayesian coalescent inference of past population dynamics from molecular sequences. , 2005, Molecular biology and evolution.

[8]  Robert E. Weiss,et al.  Improving phylogenetic analyses by incorporating additional information from genetic sequence databases , 2009, Bioinform..

[9]  S. Lagakos The challenge of subgroup analyses--reporting without distorting. , 2006, The New England journal of medicine.

[10]  E. De Clercq,et al.  Antiviral agents active against influenza A viruses , 2006, Nature reviews. Drug discovery.

[11]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[12]  T. Cacoullos Estimation of a multivariate density , 1966 .

[13]  J. Brownstein,et al.  Human vs. Animal Outbreaks of the 2009 Swine-Origin H1N1 Influenza A epidemic , 2011, EcoHealth.

[14]  Takashi Matsumoto,et al.  A Sequential Monte Carlo Method for Bayesian Face Recognition , 2006, SSPR/SPR.

[15]  D. Madigan,et al.  A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets , 2006 .

[16]  Larry Wasserman,et al.  Why isn't Everyone a Bayesian? , 2008 .

[17]  M. Suchard,et al.  Smooth skyride through a rough skyline: Bayesian coalescent-based inference of population dynamics. , 2008, Molecular biology and evolution.

[18]  D. W. Scott,et al.  Multivariate Density Estimation, Theory, Practice and Visualization , 1992 .

[19]  Bryan T Grenfell,et al.  Whole-Genome Analysis of Human Influenza A Virus Reveals Multiple Persistent Lineages and Reassortment among Recent H3N2 Viruses , 2005, PLoS biology.

[20]  S. Sumathi,et al.  Statistical Themes and Lessons for Data Mining , 2006 .

[21]  C. Robert,et al.  Computational and Inferential Difficulties with Mixture Posterior Distributions , 2000 .

[22]  Declan Butler,et al.  Swine flu goes global , 2009, Nature.

[23]  Ming-Hui Chen Importance-Weighted Marginal Bayesian Posterior Density Estimation , 1994 .

[24]  D. Rubin Using the SIR algorithm to simulate posterior distributions , 1988 .

[25]  Diane Lambert,et al.  What Use is Statistics for Massive Data? , 2000, ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.

[26]  Gavin J. D. Smith,et al.  Origins and evolutionary genomics of the 2009 swine-origin H1N1 influenza A epidemic , 2009, Nature.

[27]  L. Knowles,et al.  The burgeoning field of statistical phylogeography , 2003, Journal of evolutionary biology.

[28]  Nando de Freitas,et al.  An Introduction to Sequential Monte Carlo Methods , 2001, Sequential Monte Carlo Methods in Practice.

[29]  S. Salzberg,et al.  Large-scale sequencing of human influenza reveals the dynamic nature of viral genome evolution , 2005, Nature.

[30]  Jon R. Kettenring,et al.  Massive datasets , 2009 .

[31]  Charles Anderson,et al.  The end of theory: The data deluge makes the scientific method obsolete , 2008 .

[32]  ' DERRICKJ.ZWICKL,et al.  Model Parameterization , Prior Distributions , and the General Time-Reversible Model in Bayesian Phylogenetics , 2005 .

[33]  Song-xi Chen,et al.  Probability Density Function Estimation Using Gamma Kernels , 2000 .

[34]  Simon J. Godsill,et al.  An Overview of Existing Methods and Recent Advances in Sequential Monte Carlo , 2007, Proceedings of the IEEE.

[35]  Cliburn Chan,et al.  Understanding GPU Programming for Statistical Computation: Studies in Massively Parallel Massive Mixtures , 2010, Journal of computational and graphical statistics : a joint publication of American Statistical Association, Institute of Mathematical Statistics, Interface Foundation of North America.

[36]  David B. Allison,et al.  Statistical genetics & statistical genomics: Where biology, epistemology, statistics, and computation collide , 2009, Comput. Stat. Data Anal..

[37]  Alexei J Drummond,et al.  Estimating mutation parameters, population history and genealogy simultaneously from temporally spaced sequence data. , 2002, Genetics.

[38]  R. Webster,et al.  The Influenza Virus Enigma , 2009, Cell.

[39]  B. Efron,et al.  Stein's Paradox in Statistics , 1977 .

[40]  David Madigan,et al.  A Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets , 2003, Data Mining and Knowledge Discovery.

[41]  E. Holmes,et al.  The evolution of epidemic influenza , 2007, Nature Reviews Genetics.

[42]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[43]  John Geweke,et al.  Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments , 1991 .

[44]  Anthony S. Fauci,et al.  Race against time , 2005, Nature.

[45]  Marc A. Suchard,et al.  Many-core algorithms for statistical phylogenetics , 2009, Bioinform..

[46]  Kathy Hancock Influenza A Virus , 2020, Definitions.

[47]  S. Wright,et al.  Evolution in Mendelian Populations. , 1931, Genetics.

[48]  M. Suchard,et al.  Hierarchical phylogenetic models for analyzing multipartite sequence data. , 2003, Systematic biology.

[49]  J. Yewdell,et al.  Influenza virus still surprises. , 2002, Current opinion in microbiology.

[50]  W. Boscardin,et al.  Modeling the Covariance and Correlation Matrix of Repeated Measures , 2005 .

[51]  N. Chopin A sequential particle filter method for static models , 2002 .

[52]  C. Viboud,et al.  Explorer The genomic and epidemiological dynamics of human influenza A virus , 2016 .

[53]  Li-Jung Liang,et al.  A Hierarchical Semiparametric Regression Model for Combining HIV‐1 Phylogenetic Analyses Using Iterative Reweighting Algorithms , 2007, Biometrics.

[54]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .