Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information

MOTIVATION There has been great expectation that the knowledge of an individual's genotype will provide a basis for assessing susceptibility to diseases and designing individualized therapy. Non-synonymous single nucleotide polymorphisms (nsSNPs) that lead to an amino acid change in the protein product are of particular interest because they account for nearly half of the known genetic variations related to human inherited diseases. To facilitate the identification of disease-associated nsSNPs from a large number of neutral nsSNPs, it is important to develop computational tools to predict the phenotypic effects of nsSNPs. RESULTS We prepared a training set based on the variant phenotypic annotation of the Swiss-Prot database and focused our analysis on nsSNPs having homologous 3D structures. Structural environment parameters derived from the 3D homologous structure as well as evolutionary information derived from the multiple sequence alignment were used as predictors. Two machine learning methods, support vector machine and random forest, were trained and evaluated. We compared the performance of our method with that of the SIFT algorithm, which is one of the best predictive methods to date. An unbiased evaluation study shows that for nsSNPs with sufficient evolutionary information (with not <10 homologous sequences), the performance of our method is comparable with the SIFT algorithm, while for nsSNPs with insufficient evolutionary information (<10 homologous sequences), our method outperforms the SIFT algorithm significantly. These findings indicate that incorporating structural information is critical to achieving good prediction accuracy when sufficient evolutionary information is not available. AVAILABILITY The codes and curated dataset are available at http://compbio.utmem.edu/snp/dataset/

[1]  Patrice Koehl,et al.  The ASTRAL Compendium in 2004 , 2003, Nucleic Acids Res..

[2]  D. Chasman,et al.  Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. , 2001, Journal of molecular biology.

[3]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[4]  Christopher J. Lee,et al.  Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences , 2000, Nature Genetics.

[5]  Yan P. Yuan,et al.  HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources , 2002, Nucleic Acids Res..

[6]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[7]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[8]  Shinichi Morishita,et al.  On Classification and Regression , 1998, Discovery Science.

[9]  P. Stenson,et al.  Human Gene Mutation Database (HGMD , 2003 .

[10]  J. Moult,et al.  SNPs, protein structure, and disease , 2001, Human mutation.

[11]  David Ward,et al.  Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data , 2003, Bioinform..

[13]  Manuel C. Peitsch,et al.  SWISS-MODEL: an automated protein homology-modeling server , 2003, Nucleic Acids Res..

[14]  D. Stone,et al.  Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[16]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[17]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[18]  Z. Luthey-Schulten,et al.  Ab initio protein structure prediction. , 2002, Current opinion in structural biology.

[19]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[20]  D. Barash Deleterious mutation prediction in the secondary structure of RNAs. , 2003, Nucleic acids research.

[21]  L. Brooks,et al.  A DNA polymorphism discovery resource for research on human genetic variation. , 1998, Genome research.

[22]  P. Bork,et al.  Human non-synonymous SNPs: server and survey. , 2002, Nucleic acids research.

[23]  Helen M Berman,et al.  The Impact of Structural Genomics on the Protein Data Bank , 2004, American journal of pharmacogenomics : genomics-related research in drug development and clinical practice.

[24]  David R. Westhead,et al.  A comparative study of machine-learning methods to predict the effects of single nucleotide polymorphisms on protein function , 2003, Bioinform..

[25]  Gajendra P. S. Raghava,et al.  ESLpred: SVM-based method for subcellular localization of eukaryotic proteins using dipeptide composition and PSI-BLAST , 2004, Nucleic Acids Res..

[26]  Christopher M. Bishop,et al.  Classification and regression , 1997 .

[27]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[28]  Warren C. Lathe,et al.  Prediction of deleterious human alleles. , 2001, Human molecular genetics.

[29]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[30]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[31]  Jenn-Kang Hwang,et al.  Prediction of the bonding states of cysteines Using the support vector machines based on multiple feature vectors and cysteine state sequences , 2004, Proteins.

[32]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[33]  P. Bork,et al.  Towards a structural basis of human non-synonymous single nucleotide polymorphisms. , 2000, Trends in genetics : TIG.

[34]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[35]  Christopher T. Saunders,et al.  Evaluation of structural and evolutionary contributions to deleterious mutation prediction. , 2002, Journal of molecular biology.

[36]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[37]  P. Stenson,et al.  Human Gene Mutation Database (HGMD®): 2003 update , 2003, Human mutation.

[38]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.