Hidden Markov Dirichlet process: modeling genetic inference in open ancestral space

The problem of inferring the population structure, linkage disequilibrium pattern, and chromosomal recombination hotspots from genetic polymorphism data is essential for understanding the origin and characteristics of genome variations, with important applications to the genetic analysis of disease propensities and other complex traits. Statistical genetic methodologies developed so far mostly address these problems separately using specialized models ranging from coalescence and admixture models for population structures, to hidden Markov models and renewal processes for recombination; but most of these approaches ignore the inherent uncertainty in the genetic complexity (e.g., the number of genetic founders of a population) of the data and the close statistical and biological relationships among objects studied in these problems. We present a new statistical framework called hidden Markov Dirichlet process (HMDP) to jointly model the genetic recombinations among a possibly infinite number of founders and the coalescence-with-mutation events in the resulting genealogies. The HMDP posits that a haplotype of genetic markers is generated by a sequence of recombination events that select an ancestor for each locus from an unbounded set of founders according to a 1st-order Markov transition process. Conjoining this process with a mutation model, our method accommodates both between-lineage recombination and within-lineage sequence variations, and leads to a compact and natural interpretation of the population structure and inheritance process underlying haplotype data. We have developed an efficient sampling algorithm for HMDP based on a two-level nested Polya urn scheme, and we present experimental results on joint inference of population structure, linkage disequilibrium, and recombination hotspots based on HMDP. On both simulated and real SNP haplotype data, our method performs competitively or significantly better than extant methods in uncovering the recombination hotspots along chromosomal loci; and in addition it also infers the ancestral genetic patterns and offers a highly accurate map of ancestral compositions of modern populations.

[1]  D. Blackwell,et al.  Ferguson Distributions Via Polya Urn Schemes , 1973 .

[2]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[3]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[4]  J. Kingman On the genealogy of large populations , 1982, Journal of Applied Probability.

[5]  R. Hudson Properties of a neutral allele model with intragenic recombination. , 1983, Theoretical population biology.

[6]  F. Hoppe Pólya-like urns and the Ewens' sampling formula , 1984 .

[7]  J. Hansen FOR THE EWENS SAMPLING FORMULA , 1990 .

[8]  J. Sethuraman A CONSTRUCTIVE DEFINITION OF DIRICHLET PRIORS , 1991 .

[9]  L. Excoffier,et al.  Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. , 1995, Molecular biology and evolution.

[10]  Sylvia Richardson,et al.  Inference and monitoring convergence , 1995 .

[11]  M. Escobar,et al.  Bayesian Density Estimation and Inference Using Mixtures , 1995 .

[12]  Carl E. Rasmussen,et al.  The Infinite Gaussian Mixture Model , 1999, NIPS.

[13]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[14]  P. Donnelly,et al.  Association mapping in structured populations. , 2000, American journal of human genetics.

[15]  S. P. Fodor,et al.  Blocks of Limited Haplotype Diversity Revealed by High-Resolution Scanning of Human Chromosome 21 , 2001, Science.

[16]  Lancelot F. James,et al.  Gibbs Sampling Methods for Stick-Breaking Priors , 2001 .

[17]  P. Donnelly,et al.  A new statistical method for haplotype reconstruction from population data. , 2001, American journal of human genetics.

[18]  Carl E. Rasmussen,et al.  Factorial Hidden Markov Models , 1997 .

[19]  B. Rannala,et al.  High-resolution multipoint linkage-disequilibrium mapping in the context of a human genome sequence. , 2001, American journal of human genetics.

[20]  C. Sabatti,et al.  Bayesian analysis of haplotypes for linkage disequilibrium mapping. , 2001, Genome research.

[21]  M. Daly,et al.  High-resolution haplotype structure in the human genome , 2001, Nature Genetics.

[22]  M. Waterman,et al.  A dynamic programming algorithm for haplotype block partitioning , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Zhaohui S. Qin,et al.  Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. , 2002, American journal of human genetics.

[24]  M. Feldman,et al.  Genetic Structure of Human Populations , 2002, Science.

[25]  Dan Geiger,et al.  Model-based inference of haplotype block variation , 2003, RECOMB '03.

[26]  J. Novembre,et al.  Finding haplotype block boundaries by using the minimum-description-length principle. , 2003, American journal of human genetics.

[27]  M. Stephens,et al.  Modelling Linkage Disequilibrium , And Identifying Recombination Hotspots Using SNP Data , 2003 .

[28]  L. Excoffier,et al.  Comment on "Genetic Structure of Human Populations" , 2003, Science.

[29]  Roded Sharan,et al.  Bayesian haplo-type inference via the dirichlet process , 2004, ICML.

[30]  Ron Shamir,et al.  Maximum likelihood resolution of multi-block genotypes , 2004, RECOMB.

[31]  Dan Geiger,et al.  High density linkage disequilibrium mapping using models of haplotype block variation , 2004, ISMB/ECCB.

[32]  Dan Geiger,et al.  Model-Based Inference of Haplotype Block Variation , 2004, J. Comput. Biol..

[33]  J. Lafferty,et al.  Mixed-membership models of scientific publications , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[34]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[35]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[36]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[37]  Eric P. Xing,et al.  Hidden Markov Dirichlet Process: Modeling Genetic Recombination in Open Ancestral Space , 2006, NIPS.

[38]  Zhaohui S. Qin,et al.  A comparison of phasing algorithms for trios and unrelated individuals. , 2006, American journal of human genetics.

[39]  Michael I. Jordan,et al.  Hierarchical Dirichlet Processes , 2006 .

[40]  Roded Sharan,et al.  Bayesian haplo-type inference via the dirichlet process , 2004, ICML.