Positional conservation and amino acids shape the correct diagnosis and population frequencies of benign and damaging personal amino acid mutations.

As the cost of DNA sequencing drops, we are moving beyond one genome per species to one genome per individual to improve prevention, diagnosis, and treatment of disease by using personal genotypes. Computational methods are frequently applied to predict impairment of gene function by nonsynonymous mutations in individual genomes and single nucleotide polymorphisms (nSNPs) in populations. These computational tools are, however, known to fail 15%-40% of the time. We find that accurate discrimination between benign and deleterious mutations is strongly influenced by the long-term (among species) history of positions that harbor those mutations. Successful prediction of known disease-associated mutations (DAMs) is much higher for evolutionarily conserved positions and for original-mutant amino acid pairs that are rarely seen among species. Prediction accuracies for nSNPs show opposite patterns, forecasting impediments to building diagnostic tools aiming to simultaneously reduce both false-positive and false-negative errors. The relative allele frequencies of mutations diagnosed as benign and damaging are predicted by positional evolutionary rates. These allele frequencies are modulated by the relative preponderance of the mutant allele in the set of amino acids found at homologous sites in other species (evolutionarily permissible alleles [EPAs]). The nSNPs found in EPAs are biochemically less severe than those missing from EPAs across all allele frequency categories. Therefore, it is important to consider position evolutionary rates and EPAs when interpreting the consequences and population frequencies of human mutations. The impending sequencing of thousands of human and many more vertebrate genomes will lead to more accurate classifiers needed in real-world applications.

[1]  B. Rost,et al.  SNAP: predict effect of non-synonymous polymorphisms on function , 2007, Nucleic acids research.

[2]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[3]  S. Henikoff,et al.  Predicting the effects of amino acid substitutions on protein function. , 2006, Annual review of genomics and human genetics.

[4]  Ryan D. Hernandez,et al.  Proportionally more deleterious genetic variation in European than in African populations , 2008, Nature.

[5]  J. Lupski,et al.  The complete genome of an individual by massively parallel DNA sequencing , 2008, Nature.

[6]  Joel Dudley,et al.  TimeTree: a public knowledge-base of divergence times among organisms , 2006, Bioinform..

[7]  Justin C. Fay,et al.  A Catalog of Neutral and Deleterious Polymorphism in Yeast , 2008, PLoS genetics.

[8]  M. Kimura,et al.  The neutral theory of molecular evolution. , 1983, Scientific American.

[9]  Sudhir Kumar,et al.  Higher intensity of purifying selection on >90% of the human genes revealed by the intrinsic replacement mutation rates. , 2006, Molecular biology and evolution.

[10]  A. Eyre-Walker,et al.  The Distribution of Fitness Effects of New Deleterious Amino Acid Mutations in Humans , 2006, Genetics.

[11]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[12]  Shamil R Sunyaev,et al.  Most rare missense alleles are deleterious in humans: implications for complex disease and association studies. , 2007, American journal of human genetics.

[13]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[14]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[15]  Pietro Liò,et al.  Prediction by Graph Theoretic Measures of Structural Effects in Proteins Arising from Non-Synonymous Single Nucleotide Polymorphisms , 2008, PLoS Comput. Biol..

[16]  Sudhir Kumar,et al.  Evolutionary anatomies of positions and types of disease-associated and neutral amino acid mutations in the human genome , 2006, BMC Genomics.

[17]  Jun Guo,et al.  Predicting the phenotypic effects of non-synonymous single nucleotide polymorphisms based on support vector machines , 2007, BMC Bioinformatics.

[18]  Sudhir Kumar,et al.  Gene Expression Intensity Shapes Evolutionary Rates of the Proteins Encoded by the Vertebrate Genome , 2004, Genetics.

[19]  Timothy B. Stockwell,et al.  The Diploid Genome Sequence of an Individual Human , 2007, PLoS biology.

[20]  Jianzhi Zhang,et al.  Why are some human disease-associated mutations fixed in mice? , 2003, Trends in genetics : TIG.

[21]  Jonathan Pevsner,et al.  Basic Local Alignment Search Tool (BLAST) , 2005 .

[22]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[23]  B. Shastry SNPs in disease gene mapping, medicinal drug development and evolution , 2007, Journal of Human Genetics.

[24]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[25]  Alexey S Kondrashov,et al.  Direct estimates of human per nucleotide mutation rates at 20 loci causing mendelian diseases , 2003, Human mutation.

[26]  S. Sunyaev,et al.  Dobzhansky–Muller incompatibilities in protein evolution , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Joni L Rutter,et al.  Candidate single nucleotide polymorphism selection using publicly available tools: a guide for epidemiologists. , 2006, American journal of epidemiology.

[28]  Emily L. Webb,et al.  The Predicted Impact of Coding Single Nucleotide Polymorphisms Database , 2005, Cancer Epidemiology Biomarkers & Prevention.

[29]  Tatiana Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[30]  Andrew J. Grimm,et al.  Interpreting missense variants: comparing computational methods in human disease genes CDKN2A, MLH1, MSH2, MECP2, and tyrosinase (TYR) , 2007, Human mutation.

[31]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[32]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[33]  Mary Goldman,et al.  The UCSC Genome Browser database: update 2011 , 2010, Nucleic Acids Res..

[34]  M. Miller,et al.  Understanding human disease mutations through the use of interspecific genetic variation. , 2001, Human molecular genetics.

[35]  Ting Wang,et al.  The UCSC Genome Browser Database: update 2009 , 2008, Nucleic Acids Res..

[36]  Ryan D. Hernandez,et al.  Assessing the Evolutionary Impact of Amino Acid Mutations in the Human Genome , 2008, PLoS genetics.

[37]  Nancy F. Hansen,et al.  Accurate Whole Human Genome Sequencing using Reversible Terminator Chemistry , 2008, Nature.

[38]  A. Kondrashov,et al.  Distribution of the strength of selection against amino acid replacements in human proteins. , 2005, Human molecular genetics.