A population genetic hidden Markov model for detecting genomic regions under selection.

Recently, hidden Markov models have been applied to numerous problems in genomics. Here, we introduce an explicit population genetics hidden Markov model (popGenHMM) that uses single nucleotide polymorphism (SNP) frequency data to identify genomic regions that have experienced recent selection. Our popGenHMM assumes that SNP frequencies are emitted independently following diffusion approximation expectations but that neighboring SNP frequencies are partially correlated by selective state. We give results from the training and application of our popGenHMM to a set of early release data from the Drosophila Population Genomics Project (dpgp.org) that consists of approximately 7.8 Mb of resequencing from 32 North American Drosophila melanogaster lines. These results demonstrate the potential utility of our model, making predictions based on the site frequency spectrum (SFS) for regions of the genome that represent selected elements.

[1]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[2]  A. Hobolth,et al.  Genomic Relationships and Speciation Times of Human, Chimpanzee, and Gorilla Inferred from a Coalescent Hidden Markov Model , 2006, PLoS genetics.

[3]  J. Gillespie Genetic drift in an infinite population. The pseudohitchhiking model. , 2000, Genetics.

[4]  N. Risch,et al.  Estimation of individual admixture: Analytical and study design considerations , 2005, Genetic epidemiology.

[5]  Gabor T. Marth,et al.  The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations , 2004, Genetics.

[6]  J. Pritchard,et al.  A Map of Recent Positive Selection in the Human Genome , 2006, PLoS biology.

[7]  J. Gillespie Substitution processes in molecular evolution. III. Deleterious alleles. , 1994, Genetics.

[8]  M. Stephens,et al.  Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. , 2003, Genetics.

[9]  Colin N. Dewey,et al.  Population Genomics: Whole-Genome Analysis of Polymorphism and Divergence in Drosophila simulans , 2007, PLoS biology.

[10]  D. Hartl,et al.  Population genetics of polymorphism and divergence. , 1992, Genetics.

[11]  M. Kimura,et al.  Theoretical foundation of population genetics at the molecular level. , 1971, Theoretical population biology.

[12]  M. Shriver,et al.  Interrogating a high-density SNP map for signatures of natural selection. , 2002, Genome research.

[13]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Gillespie Substitution processes in molecular evolution. I. Uniform and clustered substitutions in a haploid model. , 1993, Genetics.

[15]  G. Coop,et al.  THE SIGNATURE OF POSITIVE SELECTION ON STANDING GENETIC VARIATION , 2005, Evolution; international journal of organic evolution.

[16]  N. Barton,et al.  Genetic hitchhiking. , 2000, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[17]  J. Gillespie SUBSTITUTION PROCESSES IN MOLECULAR EVOLUTION. II. EXCHANGEABLE MODELS FROM POPULATION GENETICS , 1994, Evolution; international journal of organic evolution.

[18]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[19]  P. Andolfatto Adaptive evolution of non-coding DNA in Drosophila , 2005, Nature.

[20]  A. Clark,et al.  Recent and ongoing selection in the human genome , 2007, Nature Reviews Genetics.

[21]  N. Risch,et al.  Reconstructing genetic ancestry blocks in admixed individuals. , 2006, American journal of human genetics.

[22]  Michael I. Jordan,et al.  On the Inference of Ancestries in Admixed Populations , 2008, RECOMB.

[23]  A. Kern,et al.  Molecular Population Genetics of Male Accessory Gland Proteins in the Drosophila simulans Complex , 2004, Genetics.

[24]  Obi L. Griffith,et al.  ORegAnno: an open access database and curation system for literature-derived promoters, transcription factor binding sites and regulatory variation , 2006, Bioinform..

[25]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[26]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[27]  J. Gillespie Junk ain't what junk does: neutral alleles in a selected context. , 1997, Gene.

[28]  S. Wright Evolution and the Genetics of Populations, Volume 3: Experimental Results and Evolutionary Deductions , 1977 .

[29]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[30]  R. Lewontin,et al.  The Genetic Basis of Evolutionary Change , 2022 .

[31]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[32]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[33]  C. Bustamante,et al.  Population Genetics of Polymorphism and Divergence for Diploid Selection Models With Arbitrary Dominance , 2004, Genetics.

[34]  W. Li,et al.  Statistical tests of neutrality of mutations. , 1993, Genetics.

[35]  Ryan D. Hernandez,et al.  Simultaneous inference of selection and population growth from patterns of variation in the human genome , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[36]  A. Kern Correcting the Site Frequency Spectrum for Divergence-Based Ascertainment , 2009, PloS one.

[37]  R. A. Fisher,et al.  The Genetical Theory of Natural Selection , 1931 .

[38]  Sewall Wright,et al.  The theory of gene frequencies , 1969 .

[39]  C. Bustamante,et al.  A Composite-Likelihood Approach for Detecting Directional Selection From DNA Sequence Data , 2005, Genetics.

[40]  Kevin R. Thornton,et al.  Approximate Bayesian Inference Reveals Evidence for a Recent, Severe Bottleneck in a Netherlands Population of Drosophila melanogaster , 2006, Genetics.

[41]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[42]  P. Donnelly,et al.  Inference of population structure using multilocus genotype data. , 2000, Genetics.

[43]  Ryan D. Hernandez,et al.  Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome , 2008, PLoS genetics.

[44]  M. Adams,et al.  Inferring Nonneutral Evolution from Human-Chimp-Mouse Orthologous Gene Trios , 2003, Science.

[45]  William H. Press,et al.  Numerical recipes in C , 2002 .

[46]  H. A. Orr,et al.  A Pseudohitchhiking Model of X vs. Autosomal Diversity , 2004, Genetics.

[47]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[48]  Andrew G. Clark,et al.  Reconstituting the Frequency Spectrum of Ascertained Single-Nucleotide Polymorphism Data , 2004, Genetics.

[49]  D. Hartl,et al.  Directional selection and the site-frequency spectrum. , 2001, Genetics.

[50]  H. Gillespie EXCHANGEABLE MODELS FROM POPULATION GENETICS , 1994 .

[51]  W. Stephan,et al.  Detecting a local signature of genetic hitchhiking along a recombining chromosome. , 2002, Genetics.

[52]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[53]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[54]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[55]  W Stephan,et al.  The hitchhiking effect on the site frequency spectrum of DNA polymorphisms. , 1995, Genetics.

[56]  Jason E Stajich,et al.  Disentangling the effects of demography and selection in human history. , 2004, Molecular biology and evolution.

[57]  M. Kimura,et al.  The neutral theory of molecular evolution. , 1983, Scientific American.

[58]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[59]  S. Wright,et al.  The Distribution of Gene Frequencies Under Irreversible Mutation. , 1938, Proceedings of the National Academy of Sciences of the United States of America.

[60]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[61]  Carlos Bustamante,et al.  Genomic scans for selective sweeps using SNP data. , 2005, Genome research.

[62]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[63]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[64]  Christian Schlötterer,et al.  Detecting Selective Sweeps: A New Approach Based on Hidden Markov Models , 2009, Genetics.

[65]  R. Punnett,et al.  The Genetical Theory of Natural Selection , 1930, Nature.

[66]  Justin C. Fay,et al.  Hitchhiking under positive Darwinian selection. , 2000, Genetics.

[67]  S. Wright,et al.  Evolution and the Genetics of Populations: Volume 2, The Theory of Gene Frequencies , 1968 .

[68]  J. Gillespie The causes of molecular evolution , 1991 .

[69]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[70]  C. Bustamante,et al.  Distinguishing Between Selective Sweeps and Demography Using DNA Polymorphism Data , 2005, Genetics.

[71]  Stephen L. Hauser,et al.  Genome-wide patterns of population structure and admixture in West Africans and African Americans , 2009, Proceedings of the National Academy of Sciences.