Sequence-based prioritization of nonsynonymous single-nucleotide polymorphisms for the study of disease mutations.

The increasing demand for the identification of genetic variation responsible for common diseases has translated into a need for sophisticated methods for effectively prioritizing mutations occurring in disease-associated genetic regions. In this article, we prioritize candidate nonsynonymous single-nucleotide polymorphisms (nsSNPs) through a bioinformatics approach that takes advantages of a set of improved numeric features derived from protein-sequence information and a new statistical learning model called "multiple selection rule voting" (MSRV). The sequence-based features can maximize the scope of applications of our approach, and the MSRV model can capture subtle characteristics of individual mutations. Systematic validation of the approach demonstrates that this approach is capable of prioritizing causal mutations for both simple monogenic diseases and complex polygenic diseases. Further studies of familial Alzheimer diseases and diabetes show that the approach can enrich mutations underlying these polygenic diseases among the top of candidate mutations. Application of this approach to unclassified mutations suggests that there are 10 suspicious mutations likely to cause diseases, and there is strong support for this in the literature.

[1]  M. Orozco,et al.  Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. , 2002, Journal of molecular biology.

[2]  J. Hirschhorn,et al.  A comprehensive review of genetic association studies , 2002, Genetics in Medicine.

[3]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[4]  Hua Yang,et al.  Searching for interpretable rules for disease mutations: a simulated annealing bump hunting strategy , 2006, BMC Bioinformatics.

[5]  C Cruz,et al.  Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. , 1994, Journal of molecular biology.

[6]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[7]  Jeffrey Miller,et al.  Genetic Studies of Lac Repressor: 4000 Single Amino Acid Substitutions and Analysis of the Resulting Phenotypes on the Basis of the Protein Structure , 1996, German Conference on Bioinformatics.

[8]  Andreas Prlic,et al.  Ensembl 2006 , 2005, Nucleic Acids Res..

[9]  M. Olivier A haplotype map of the human genome. , 2003, Nature.

[10]  A. Garg,et al.  A Novel Heterozygous Mutation in Peroxisome Proliferator-Activated Receptor-γ Gene in a Patient with Familial Partial Lipodystrophy , 2002 .

[11]  J. Seidman,et al.  Autosomal dominant hypocalcaemia caused by a Ca2+-sensing receptor gene mutation , 1994, Nature Genetics.

[12]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[13]  D. Chasman,et al.  Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. , 2001, Journal of molecular biology.

[14]  H. Muller The American Journal of Human Genetics Vol . 2 No . 2 June 1950 Our Load of Mutations 1 , 2006 .

[15]  K. Olsen,et al.  Hemoglobin connecticut (β21(b3) Asp→Gly): A hemoglobin variant with low oxygen affinity , 1981 .

[16]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[17]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[18]  James R. Knight,et al.  Genome sequencing in microfabricated high-density picolitre reactors , 2005, Nature.

[19]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[20]  Y. Blouquit,et al.  Hemoglobin La Desiradb αA2β2 129 (H7) Ala → Val: A New Unstable Hemoglobin , 1986 .

[21]  M. Olivier A haplotype map of the human genome , 2003, Nature.

[22]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[23]  S. Bouvier,et al.  Systematic mutation of bacteriophage T4 lysozyme. , 1991, Journal of molecular biology.

[24]  S. O’Rahilly,et al.  human PPARg associated with severe insulin resistance, diabetes mellitus and hypertension , 1999 .

[25]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[26]  Christopher T. Saunders,et al.  Evaluation of structural and evolutionary contributions to deleterious mutation prediction. , 2002, Journal of molecular biology.

[27]  S. O’Rahilly,et al.  Non-DNA binding, dominant-negative, human PPARγ mutations cause lipodystrophic insulin resistance , 2006, Cell metabolism.

[28]  F. Hecht,et al.  Hemoglobin-Seattle (α2Aβ276 Glu): An Unstable Hemoglobin Causing Chronic Hemolytic Anemia , 1970 .

[29]  R. Doolittle,et al.  A simple method for displaying the hydropathic character of a protein. , 1982, Journal of molecular biology.

[30]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[31]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[32]  R. Hegele,et al.  PPARG F388L, a transactivation-deficient mutant, in familial partial lipodystrophy. , 2002, Diabetes.

[33]  Eric S. Lander,et al.  The common PPARγ Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes , 2000, Nature Genetics.

[34]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[35]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[36]  Albert Y Lau,et al.  Functional classification of proteins and protein variants. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[37]  E. Lander,et al.  Genetic dissection of complex traits science , 1994 .

[38]  G. Stamatoyannopoulos,et al.  Physiologic implications of a hemoglobin with decreased oxygen affinity (hemoglobin seattle). , 1969, The New England journal of medicine.

[39]  P. Thomas,et al.  Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[40]  Yan Cui,et al.  Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information , 2005, Bioinform..

[41]  B. Moats-Staats,et al.  A novel mutation (E767K) in the second extracellular loop of the calcium sensing receptor in a family with autosomal dominant hypocalcemia , 2005, American journal of medical genetics. Part A.

[42]  Warren C. Lathe,et al.  Prediction of deleterious human alleles. , 2001, Human molecular genetics.

[43]  Peng Yue,et al.  SNPs3D: Candidate gene and SNP selection for association studies , 2006, BMC Bioinformatics.

[44]  M. Orozco,et al.  Sequence‐based prediction of pathological mutations , 2004, Proteins.

[45]  E S Lander,et al.  The common PPARgamma Pro12Ala polymorphism is associated with decreased risk of type 2 diabetes. , 2000, Nature genetics.

[46]  Claudio J. Verzilli,et al.  A hierarchical Bayesian model for predicting the functional consequences of amino‐acid polymorphisms , 2005 .

[47]  金田 重郎,et al.  C4.5: Programs for Machine Learning (書評) , 1995 .

[48]  Anushya Muruganujan,et al.  PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification , 2003, Nucleic Acids Res..

[49]  K. Tokunaga,et al.  Identification of the gene variations in human CD22 , 1999, Immunogenetics.

[50]  S. Miwa,et al.  Hemoglobin saitama or beta 117 (G19) His leads to Pro, a new variant causing hemolytic disease. , 1983, Hemoglobin.

[51]  M. Gerstein,et al.  Sequence variation in G-protein-coupled receptors: analysis of single nucleotide polymorphisms , 2005, Nucleic acids research.

[52]  G. Burchard,et al.  Variability of the CD36 gene in West Africa , 2001, Human mutation.

[53]  David R. Westhead,et al.  A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function , 2003, Bioinform..

[54]  J. Nadeau,et al.  Finding Genes That Underlie Complex Traits , 2002, Science.

[55]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[56]  Richard A Mathies,et al.  Microfabricated bioprocessor for integrated nanoliter-scale Sanger DNA sequencing. , 2006, Proceedings of the National Academy of Sciences of the United States of America.