Bayesian nonparametric models of genetic variation

We will develop three new Bayesian nonparametric models for genetic variation. These models are all dynamic-clustering approximations of the ancestral recombination graph (or ARG), a structure that fully describes the genetic history of a population. Due to its complexity, efficient inference for the ARG is not possible. However, different aspects of the ARG can be captured by the approximations discussed in our work. The ARG can be described by a tree valued HMM where the trees vary along the genetic sequence. Many modern models of genetic variation proceed by approximating these trees with (often finite) clusterings. We will consider Bayesian nonparametric priors for the clustering, thereby providing nonparametric generalizations of these models and avoiding problems with model selection and label switching. Further, we will compare the performance of these models on a wide selection of inference problems in genetics such as phasing, imputation, genome wide association and admixture or bottleneck discovery. These experiments should provide a common testing ground on which the different approximations inherent in modern genetic models can be compared. The results of these experiments should shed light on the nature of the approximations and guide future application of these models.

[1]  S. MacEachern Decision Theoretic Aspects of Dependent Nonparametric Processes , 2000 .

[2]  D. Dunson Bayesian dynamic modeling of latent trait distributions. , 2006, Biostatistics.

[3]  S. Ghosal Bayesian Nonparametrics: The Dirichlet process, related priors and posterior asymptotics , 2010 .

[4]  A. Jeffreys,et al.  Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex , 2001, Nature Genetics.

[5]  R. Hudson Properties of a neutral allele model with intragenic recombination. , 1983, Theoretical population biology.

[6]  Frank D. Wood,et al.  A New Approach to Probabilistic Programming Inference , 2014, AISTATS.

[7]  Peter I. Frazier,et al.  Distance dependent Chinese restaurant processes , 2009, ICML.

[8]  J. Marchini,et al.  Genotype imputation for genome-wide association studies , 2010, Nature Reviews Genetics.

[9]  P. Donnelly,et al.  A Fine-Scale Map of Recombination Rates and Hotspots Across the Human Genome , 2005, Science.

[10]  M. Stephens,et al.  Modeling linkage disequilibrium and identifying recombination hotspots using single-nucleotide polymorphism data. , 2003, Genetics.

[11]  D. Aldous Exchangeability and related topics , 1985 .

[12]  Gilles Celeux,et al.  Bayesian Inference for Mixture: The Label Switching Problem , 1998, COMPSTAT.

[13]  A. W. F. Edwards,et al.  The statistical processes of evolutionary theory , 1963 .

[14]  Yee Whye Teh,et al.  Improvements to the Sequence Memoizer , 2010, NIPS.

[15]  J. Pritchard,et al.  Confounding from Cryptic Relatedness in Case-Control Association Studies , 2005, PLoS genetics.

[16]  Y. Teh,et al.  Modeling Population Structure Under Hierarchical Dirichlet Processes , 2015, Bayesian Analysis.

[17]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[18]  Yee Whye Teh,et al.  Bayesian multi-population haplotype inference via a hierarchical dirichlet process mixture , 2006, ICML.

[19]  P. Donnelly,et al.  A new multipoint method for genome-wide association studies by imputation of genotypes , 2007, Nature Genetics.

[20]  S. Wright,et al.  Evolution in Mendelian Populations. , 1931, Genetics.

[21]  B. Browning,et al.  Haplotype phasing: existing methods and new developments , 2011, Nature Reviews Genetics.

[22]  S. O’Brien,et al.  Elephant seal genetic variation and the use of simulation models to investigate historical population bottlenecks. , 1993, The Journal of heredity.

[23]  K. Holsinger The neutral theory of molecular evolution , 2004 .

[24]  P. Müller,et al.  Bayesian Nonparametrics: An invitation to Bayesian nonparametrics , 2010 .

[25]  Noah D. Goodman,et al.  Lightweight Implementations of Probabilistic Programming Languages Via Transformational Compilation , 2011, AISTATS.

[26]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[27]  Thomas L. Griffiths,et al.  Learning Systems of Concepts with an Infinite Relational Model , 2006, AAAI.

[28]  Wing Hung Wong,et al.  Completely phased genome sequencing through chromosome sorting , 2010, Proceedings of the National Academy of Sciences.

[29]  Yee Whye Teh,et al.  Fast MCMC sampling for Markov jump processes and continuous time Bayesian networks , 2011, UAI.

[30]  Frank D. Wood,et al.  A Compilation Target for Probabilistic Programming Languages , 2014, ICML.

[31]  Paul Scheet,et al.  A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. , 2006, American journal of human genetics.

[32]  Dennis M. Wilkinson,et al.  Large-Scale Parallel Collaborative Filtering for the Netflix Prize , 2008, AAIM.

[33]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: dominant markers and null alleles , 2007, Molecular ecology notes.

[34]  Arnaud Doucet,et al.  Generalized Polya Urn for Time-varying Dirichlet Process Mixtures , 2007, UAI.

[35]  K. T. Poole,et al.  A Spatial Model for Legislative Roll Call Analysis , 1985 .

[36]  Michael I. Jordan,et al.  A Sticky HDP-HMM With Application to Speaker Diarization , 2009, 0905.2592.

[37]  R. Punnett,et al.  The Genetical Theory of Natural Selection , 1930, Nature.

[38]  Ruslan Salakhutdinov,et al.  Probabilistic Matrix Factorization , 2007, NIPS.

[39]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[40]  Matthew D. Rasmussen,et al.  Genome-Wide Inference of Ancestral Recombination Graphs , 2013, PLoS genetics.

[41]  E. Xing,et al.  Mixed Membership Stochastic Block Models for Relational Data with Application to Protein-Protein Interactions , 2006 .

[42]  Zoubin Ghahramani,et al.  A reversible infinite HMM using normalised random measures , 2014, ICML.

[43]  Kenneth J. Hochberg,et al.  Wandering Random Measures in the Fleming-Viot Model , 1982 .

[44]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[45]  Yee Whye Teh,et al.  Bayesian Nonparametric Models , 2010, Encyclopedia of Machine Learning.

[46]  J. Hein,et al.  Recombination as a point process along sequences. , 1999, Theoretical population biology.

[47]  Eric P. Xing,et al.  Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space , 2006, NIPS.

[48]  David Haussler,et al.  LS-SNP: large-scale annotation of coding non-synonymous SNPs based on multiple information sources , 2005, Bioinform..

[49]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[50]  John M. Hancock,et al.  Phylogenetic inference under recombination using Bayesian stochastic topology selection , 2008, Bioinform..

[51]  S. Frühwirth-Schnatter Data Augmentation and Dynamic Linear Models , 1994 .

[52]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[53]  Ajay Jasra,et al.  Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling , 2005 .

[54]  Roded Sharan,et al.  Bayesian haplo-type inference via the dirichlet process , 2004, ICML.

[55]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[56]  E. Xing,et al.  Dynamic Non-Parametric Mixture Models and The Recurrent Chinese Restaurant Process a , 2008 .

[57]  Eric P. Xing,et al.  Hidden Markov Dirichlet process: modeling genetic inference in open ancestral space , 2007 .

[58]  Le Song,et al.  Infinite Hierarchical MMSB Model for Nested Communities/Groups in Social Networks , 2010, 1010.1868.

[59]  John D. Lafferty,et al.  A correlated topic model of Science , 2007, 0708.3601.

[60]  J. Pitman,et al.  The two-parameter Poisson-Dirichlet distribution derived from a stable subordinator , 1997 .

[61]  Yee Whye Teh,et al.  Collapsed Variational Dirichlet Process Mixture Models , 2007, IJCAI.

[62]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[63]  C. J-F,et al.  THE COALESCENT , 1980 .

[64]  Carsten Wiuf,et al.  Gene Genealogies, Variation and Evolution - A Primer in Coalescent Theory , 2004 .

[65]  G. McVean,et al.  Approximating the coalescent with recombination , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[66]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[67]  Adam H. Marblestone,et al.  Molecular Threading: Mechanical Extraction, Stretching and Placement of DNA Molecules from a Liquid-Air Interface , 2013, PloS one.

[68]  Yee Whye Teh,et al.  Spatial Normalized Gamma Processes , 2009, NIPS.

[69]  Yee Whye Teh,et al.  Beam sampling for the infinite hidden Markov model , 2008, ICML '08.

[70]  Yura N. Perov,et al.  Venture: a higher-order probabilistic programming platform with programmable inference , 2014, ArXiv.

[71]  J. Crow,et al.  THE NUMBER OF ALLELES THAT CAN BE MAINTAINED IN A FINITE POPULATION. , 1964, Genetics.

[72]  Emily B. Fox,et al.  Effective Split-Merge Monte Carlo Methods for Nonparametric Models of Sequential Data , 2012, NIPS.

[73]  R. Fisher,et al.  The Relation Between the Number of Species and the Number of Individuals in a Random Sample of an Animal Population , 1943 .

[74]  B. Browning,et al.  A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. , 2009, American journal of human genetics.

[75]  Yee Whye Teh,et al.  An Efficient Sequential Monte Carlo Algorithm for Coalescent Clustering , 2008, NIPS.

[76]  A. Doucet,et al.  Particle Markov chain Monte Carlo methods , 2010 .

[77]  Sharon R Browning,et al.  Multilocus association mapping using variable-length Markov chains. , 2006, American journal of human genetics.

[78]  Yee Whye Teh,et al.  Dirichlet Process , 2017, Encyclopedia of Machine Learning and Data Mining.

[79]  Andrew Gelman,et al.  Applied Bayesian Modeling And Causal Inference From Incomplete-Data Perspectives , 2005 .

[80]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[81]  O. Delaneau,et al.  A linear complexity phasing method for thousands of genomes , 2011, Nature Methods.

[82]  P. Eric,et al.  A Nonparametric Bayesian Approach for Haplotype Reconstruction from Single and Multi-Population Data , 2007 .

[83]  Yee Whye Teh,et al.  Scalable imputation of genetic data with a discrete fragmentation-coagulation process , 2012, NIPS.

[84]  J. Pitman Coalescents with multiple collisions , 1999 .

[85]  J. Pitman Combinatorial Stochastic Processes , 2006 .

[86]  Thomas L. Griffiths,et al.  Nonparametric Latent Feature Models for Link Prediction , 2009, NIPS.

[87]  W. Ewens The sampling theory of selectively neutral alleles. , 1972, Theoretical population biology.

[88]  O. Delaneau,et al.  Supplementary Information for ‘ Improved whole chromosome phasing for disease and population genetic studies ’ , 2012 .

[89]  Joshua B. Tenenbaum,et al.  Modelling Relational Data using Bayesian Clustered Tensor Factorization , 2009, NIPS.

[90]  Joshua B. Tenenbaum,et al.  Church: a language for generative models , 2008, UAI.

[91]  Fredrik Lindsten,et al.  Ancestor Sampling for Particle Gibbs , 2012, NIPS.

[92]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[93]  Yee Whye Teh,et al.  Modelling Genetic Variations using Fragmentation-Coagulation Processes , 2011, NIPS.

[94]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..