FunSAV: Predicting the Functional Effect of Single Amino Acid Variants Using a Two-Stage Random Forest Model

Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets can be downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/sjn/FunSAV.

[1]  C. Hill,et al.  Structure of ATP-bound human ATP:cobalamin adenosyltransferase. , 2006, Biochemistry.

[2]  Richard J. B. Dobson,et al.  Predicting deleterious nsSNPs: an analysis of sequence and structural attributes , 2006, BMC Bioinformatics.

[3]  Yutaka Kuroda,et al.  DROP: an SVM domain linker predictor trained with optimal features selected by random forest , 2011, Bioinform..

[4]  Alexander V. Diemand,et al.  The Swiss‐Prot variant page and the ModSNP database: A resource for sequence and structure information on human protein variants , 2004, Human mutation.

[5]  Zheng Yuan,et al.  Prediction of protein B‐factor profiles , 2005, Proteins.

[6]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[7]  Predrag Radivojac,et al.  Automated inference of molecular mechanisms of disease from amino acid substitutions , 2009, Bioinform..

[8]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[9]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[10]  W. Delano The PyMOL Molecular Graphics System , 2002 .

[11]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[12]  S. Sunyaev,et al.  PSIC: profile extraction from sequence alignments with position-specific counts of independent observations. , 1999, Protein engineering.

[13]  Tom H. Pringle,et al.  Complete Khoisan and Bantu genomes from southern Africa , 2010, Nature.

[14]  Daniel Rios,et al.  Ensembl 2011 , 2010, Nucleic Acids Res..

[15]  M. Campbell,et al.  PANTHER: a library of protein families and subfamilies indexed by function. , 2003, Genome research.

[16]  Steven Henikoff,et al.  SIFT: predicting amino acid changes that affect protein function , 2003, Nucleic Acids Res..

[17]  Manuel C. Peitsch,et al.  SWISS-MODEL: an automated protein homology-modeling server , 2003, Nucleic Acids Res..

[18]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[19]  Hui Lu,et al.  Finding new structural and sequence attributes to predict possible disease association of single amino acid polymorphism (SAP) , 2007, Bioinform..

[20]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[21]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[22]  B. Merinero,et al.  Genetic analysis of three genes causing isolated methylmalonic acidemia: identification of 21 novel allelic variants. , 2005, Molecular genetics and metabolism.

[23]  D. Chasman,et al.  Predicting the functional consequences of non-synonymous single nucleotide polymorphisms: structure-based assessment of amino acid variation. , 2001, Journal of molecular biology.

[24]  Z. Luthey-Schulten,et al.  Ab initio protein structure prediction. , 2002, Current opinion in structural biology.

[25]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[26]  L. Serrano,et al.  Prediction of sequence-dependent and mutational effects on the aggregation of peptides and proteins , 2004, Nature Biotechnology.

[27]  J. Moult,et al.  SNPs, protein structure, and disease , 2001, Human mutation.

[28]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[29]  Li Yang,et al.  Predicting disease-associated substitution of a single amino acid by analyzing residue interactions , 2011, BMC Bioinformatics.

[30]  Melissa S. Cline,et al.  Using bioinformatics to predict the functional impact of SNVs , 2011, Bioinform..

[31]  T. Akutsu,et al.  Predicting functional impact of single amino acid polymorphisms by integrating sequence and structural features , 2011, 2011 IEEE International Conference on Systems Biology (ISB).

[32]  T. Blundell,et al.  Structural and Functional Restraints on the Occurrence of Single Amino Acid Variations in Human Proteins , 2010, PloS one.

[33]  Markus Affolter,et al.  Structural basis of BMP signalling inhibition by the cystine knot protein Noggin , 2002, Nature.

[34]  U. Samanta,et al.  Crystal Structure of Human Plasma Platelet-activating Factor Acetylhydrolase , 2008, Journal of Biological Chemistry.

[35]  J. Thornton,et al.  Satisfying hydrogen bonding potential in proteins. , 1994, Journal of molecular biology.

[36]  E. Lander,et al.  Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999 .

[37]  Jiangning Song,et al.  HSEpred: predict half-sphere exposure from protein sequences , 2008, Bioinform..

[38]  Sangsoo Kim,et al.  The first Korean genome sequence and analysis: full genome sequencing for a socio-ethnic group. , 2009, Genome research.

[39]  Christopher T. Saunders,et al.  Evaluation of structural and evolutionary contributions to deleterious mutation prediction. , 2002, Journal of molecular biology.

[40]  Burkhard Rost,et al.  MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data , 2010, Nucleic Acids Res..

[41]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[42]  Yan Cui,et al.  Prediction of the phenotypic effects of non-synonymous single nucleotide polymorphisms using structural and evolutionary information , 2005, Bioinform..

[43]  Russ B. Altman,et al.  Bioinformatics challenges for personalized medicine , 2011, Bioinform..

[44]  Emidio Capriotti,et al.  Bioinformatics Original Paper Predicting the Insurgence of Human Genetic Diseases Associated to Single Point Protein Mutations with Support Vector Machines and Evolutionary Information , 2022 .

[45]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[46]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[47]  M. Miller,et al.  Understanding human disease mutations through the use of interspecific genetic variation. , 2001, Human molecular genetics.

[48]  Stefan M. Larson,et al.  Analysis of covariation in an SH3 domain sequence alignment: applications in tertiary contact prediction and the design of compensating hydrophobic core substitutions. , 2000, Journal of molecular biology.

[49]  Xing-Ming Zhao,et al.  A novel approach to extracting features from motif content and protein composition for protein sequence classification , 2005, Neural Networks.

[50]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[51]  Dmitrij Frishman,et al.  Correlated Mutations: A Hallmark of Phenotypic Amino Acid Substitutions , 2010, PLoS Comput. Biol..

[52]  Tao Zhang,et al.  Prediction of function changes associated with single‐point protein mutations using support vector machines (SVMs) , 2009, Human mutation.

[53]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[54]  M. Cargill Characterization of single-nucleotide polymorphisms in coding regions of human genes , 1999, Nature Genetics.

[55]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[56]  Gregory B. Gloor,et al.  Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction , 2008, Bioinform..

[57]  D. Botstein,et al.  Discovering genotypes underlying human phenotypes: past successes for mendelian disease, future approaches for complex disease , 2003, Nature Genetics.

[58]  T. Hamelryck An amino acid has two sides: A new 2D measure provides a different view of solvent exposure , 2005, Proteins.

[59]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[60]  S. Kasif,et al.  Structural location of disease-associated single-nucleotide polymorphisms. , 2003, Journal of molecular biology.

[61]  L. C. Martin,et al.  Using information theory to search for co-evolving residues in proteins , 2005, Bioinform..

[62]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[63]  Alan F. Scott,et al.  Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders , 2002, Nucleic Acids Res..

[64]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[65]  Victor O. Sadras,et al.  Use of Lorenz curves and Gini coefficients to assess yield inequality within paddocks , 2004 .

[66]  Pierre Baldi,et al.  SCRATCH: a protein structure and structural feature prediction server , 2005, Nucleic Acids Res..

[67]  A. Lapedes,et al.  Covariation of mutations in the V3 loop of human immunodeficiency virus type 1 envelope protein: an information theoretic analysis. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[68]  P. Bork,et al.  Prediction of nonsynonymous single nucleotide polymorphisms in human disease-associated genes , 1999, Journal of Molecular Medicine.

[69]  Ziding Zhang,et al.  Predicting Residue-Residue Contacts and Helix-Helix Interactions in Transmembrane Proteins Using an Integrative Feature-Based Random Forest Approach , 2011, PloS one.

[70]  Dawei Li,et al.  The diploid genome sequence of an Asian individual , 2008, Nature.

[71]  Mi Zhou,et al.  nsSNPAnalyzer: identifying disease-associated nonsynonymous single nucleotide polymorphisms , 2005, Nucleic Acids Res..