Context dependence, ancestral misidentification, and spurious signatures of natural selection.

Population genetic analyses often use polymorphism data from one species, and orthologous genomic sequences from closely related outgroup species. These outgroup sequences are frequently used to identify ancestral alleles at segregating sites and to compare the patterns of polymorphism and divergence. Inherent in such studies is the assumption of parsimony, which posits that the ancestral state of each single nucleotide polymorphism (SNP) is the allele that matches the orthologous site in the outgroup sequence, and that all nucleotide substitutions between species have been observed. This study tests the effect of violating the parsimony assumption when mutation rates vary across sites and over time. Using a context-dependent mutation model that accounts for elevated mutation rates at CpG dinucleotides, increased propensity for transitional versus transversional mutations, as well as other directional and contextual mutation biases estimated along the human lineage, we show (using both simulations and a theoretical model) that enough unobserved substitutions could have occurred since the divergence of human and chimpanzee to cause many statistical tests to spuriously reject neutrality. Moreover, using both the chimpanzee and rhesus macaque genomes to parsimoniously identify ancestral states causes a large fraction of the data to be removed while not completely alleviating problem. By constructing a novel model of the context-dependent mutation process, we can correct polymorphism data for the effect of ancestral misidentification using a single outgroup.

[1]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[2]  G. A. Watterson On the number of segregating sites in genetical models without recombination. , 1975, Theoretical population biology.

[3]  D. Hartl,et al.  Population genetics of polymorphism and divergence. , 1992, Genetics.

[4]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[5]  D. Hartl,et al.  Directional selection and the site-frequency spectrum. , 2001, Genetics.

[6]  S. Williamson,et al.  The genealogy of a sequence subject to purifying selection at multiple sites. , 2002, Molecular biology and evolution.

[7]  P. Andolfatto Adaptive evolution of non-coding DNA in Drosophila , 2005, Nature.

[8]  Ryan D. Hernandez,et al.  Simultaneous inference of selection and population growth from patterns of variation in the human genome , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[9]  S. Sampling theory for neutral alleles in a varying environment , 2003 .

[10]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[11]  Gabor T. Marth,et al.  The Allele Frequency Spectrum in Genome-Wide Human Variation Data Reveals Signals of Differential Demographic History in Three Large World Populations , 2004, Genetics.

[12]  Carlos Bustamante,et al.  Genomic scans for selective sweeps using SNP data. , 2005, Genome research.

[13]  W. Li,et al.  Statistical tests of neutrality of mutations. , 1993, Genetics.

[14]  S. Tavaré,et al.  The age of a mutation in a general coalescent tree , 1998 .

[15]  G. A. Watterson,et al.  Is the most frequent allele the oldest? , 1977, Theoretical population biology.

[16]  Y. Fu,et al.  Statistical properties of segregating sites. , 1995, Theoretical population biology.

[17]  Z. Yang,et al.  Statistical properties of a DNA sample under the finite-sites model. , 1996, Genetics.

[18]  R. Hudson Gene genealogies and the coalescent process. , 1990 .

[19]  F. Tajima Evolutionary relationship of DNA sequences in finite populations. , 1983, Genetics.

[20]  F. Depaulis,et al.  Effect of misoriented sites on neutrality tests with outgroup. , 2003, Genetics.

[21]  W. Ewens Mathematical Population Genetics , 1980 .

[22]  G. McVean,et al.  The effects of Hill-Robertson interference between weakly selected mutations on patterns of molecular evolution and variation. , 2000, Genetics.

[23]  Timothy B Sackton,et al.  A Scan for Positively Selected Genes in the Genomes of Humans and Chimpanzees , 2005, PLoS biology.

[24]  Robert E. Tarjan,et al.  Self-adjusting binary search trees , 1985, JACM.

[25]  H. Akashi,et al.  Inferring the fitness effects of DNA mutations from polymorphism and divergence data: statistical power to detect directional selection under stationarity and free recombination. , 1999, Genetics.

[26]  M. Slatkin,et al.  Pairwise comparisons of mitochondrial DNA sequences in stable and exponentially growing populations. , 1991, Genetics.

[27]  S T Hess,et al.  Wide variations in neighbor-dependent substitution rates. , 1994, Journal of molecular biology.

[28]  Andrew G. Clark,et al.  Reconstituting the Frequency Spectrum of Ascertained Single-Nucleotide Polymorphism Data , 2004, Genetics.

[29]  W. Stephan,et al.  Joint effects of genetic hitchhiking and background selection on neutral variation. , 2000, Genetics.

[30]  F. Tajima Statistical method for testing the neutral mutation hypothesis by DNA polymorphism. , 1989, Genetics.

[31]  Mark Gerstein,et al.  Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. , 2003, Nucleic acids research.

[32]  Richard R. Hudson,et al.  Generating samples under a Wright-Fisher neutral model of genetic variation , 2002, Bioinform..

[33]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[34]  S. Schaeffer,et al.  Natural selection and the frequency distributions of "silent" DNA polymorphism in Drosophila. , 1997, Genetics.

[35]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[36]  Justin C. Fay,et al.  Hitchhiking under positive Darwinian selection. , 2000, Genetics.

[37]  D. Cutler,et al.  Understanding the overdispersed molecular clock. , 2000, Genetics.

[38]  W. J. Kent,et al.  BLAT--the BLAST-like alignment tool. , 2002, Genome research.

[39]  J. Wakeley,et al.  Gene genealogies in a metapopulation. , 2001, Genetics.

[40]  F. Tajima The amount of DNA polymorphism maintained in a finite population when the neutral mutation rate varies among sites. , 1996, Genetics.

[41]  Ryan D. Hernandez,et al.  Natural selection on protein-coding genes in the human genome , 2005, Nature.

[42]  S. Hess,et al.  The influence of nearest neighbors on the rate and pattern of spontaneous point mutations , 1992, Journal of Molecular Evolution.

[43]  D. Hartl,et al.  Selection intensity for codon bias. , 1994, Genetics.

[44]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.