Feature-based multiple models improve classification of mutation-induced stability changes

BackgroundReliable prediction of stability changes in protein variants is an important aspect of computational protein design. A number of machine learning methods that allow a classification of stability changes knowing only the sequence of the protein emerged. However, their performance on amino acid substitutions of previously unseen non-homologous proteins is rather limited. Moreover, the performance varies for different types of mutations based on the secondary structure or accessible surface area of the mutation site.ResultsWe proposed feature-based multiple models with each model designed for a specific type of mutations. The new method is composed of five models trained for mutations in exposed, buried, helical, sheet, and coil residues. The classification of a mutation as stabilising or destabilising is made as a consensus of two models, one selected based on the predicted accessible surface area and the other based on the predicted secondary structure of the mutation site. We refer to our new method as Evolutionary, Amino acid, and Structural Encodings with Multiple Models (EASE-MM). Cross-validation results show that EASE-MM provides a notable improvement to our previous work reaching a Matthews correlation coefficient of 0.44. EASE-MM was able to correctly classify 73% and 75% of stabilising and destabilising protein variants, respectively. Using an independent test set of 238 mutations, we confirmed our results in a comparison with related work.ConclusionsEASE-MM not only outperformed other related methods but achieved more balanced results for different types of mutations based on the accessible surface area, secondary structure, or magnitude of stability changes. This can be attributed to using multiple models with the most relevant features selected for the given type of mutations. Therefore, our results support the presumption that different interactions govern stability changes in the exposed and buried residues or in residues with a different secondary structure.

[1]  Lu Lu,et al.  Bioinformatics analysis of immune response to group A streptococcal sepsis integrating quantitative trait loci mapping with genome-wide expression studies , 2008, BMC Bioinformatics.

[2]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[3]  M. Vihinen,et al.  Pathogenic or not? And if so, then how? Studying the effects of missense mutations using bioinformatics methods , 2009, Human mutation.

[4]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[5]  L. Hawthorn,et al.  Normal colon epithelium: a dataset for the analysis of gene expression and alternative splicing events in colon disease , 2010, BMC Genomics.

[6]  Hongyi Zhou,et al.  Distance‐scaled, finite ideal‐gas reference state improves structure‐derived potentials of mean force for structure selection and stability prediction , 2002, Protein science : a publication of the Protein Society.

[7]  Shuangye Yin,et al.  Eris: an automated estimator of protein stability , 2007, Nature Methods.

[8]  S. Henikoff,et al.  Predicting deleterious amino acid substitutions. , 2001, Genome research.

[9]  Sofia Khan,et al.  Spectrum of disease-causing mutations in protein secondary structures , 2007, BMC Structural Biology.

[10]  Abdul Sattar,et al.  Towards sequence-based prediction of mutation-induced stability changes in unseen non-homologous proteins , 2014, BMC Genomics.

[11]  Piero Fariselli,et al.  I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure , 2005, Nucleic Acids Res..

[12]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[13]  P. Thomas,et al.  Coding single-nucleotide polymorphisms associated with complex vs. Mendelian disease: evolutionary evidence for differences in molecular effects. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Eugene W. Myers,et al.  Basic local alignment search tool. Journal of Molecular Biology , 1990 .

[15]  Peng Yue,et al.  SNPs3D: Candidate gene and SNP selection for association studies , 2006, BMC Bioinformatics.

[16]  Di Wu,et al.  Bioinformatics analysis of the epitope regions for norovirus capsid protein , 2013, BMC Bioinformatics.

[17]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[18]  A Keith Dunker,et al.  SPINE-D: Accurate Prediction of Short and Long Disordered Regions by a Single Neural-Network Based Method , 2012, Journal of biomolecular structure & dynamics.

[19]  Shuang Wu,et al.  More powerful significant testing for time course gene expression data using functional principal component analysis approaches , 2013, BMC Bioinformatics.

[20]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[21]  Burkhard Rost,et al.  MuD: an interactive web server for the prediction of non-neutral substitutions using protein structural data , 2010, Nucleic Acids Res..

[22]  Josef Kittler,et al.  Floating search methods in feature selection , 1994, Pattern Recognit. Lett..

[23]  M. Gönen,et al.  Machine learning integration for predicting the effect of single amino acid substitutions on protein stability , 2009, BMC Structural Biology.

[24]  Gang Chen,et al.  Robust prediction of mutation-induced protein stability change by property encoding of amino acids. , 2008, Protein engineering, design & selection : PEDS.

[25]  Liangjiang Wang,et al.  Sequence feature-based prediction of protein stability changes upon amino acid substitutions , 2010, BMC Genomics.

[26]  Iosif I. Vaisman,et al.  Accurate prediction of stability changes in protein mutants by combining machine learning with structure based computational mutagenesis , 2008, Bioinform..

[27]  B. L. de Groot,et al.  Predicting free energy changes using structural ensembles. , 2009, Nature methods.

[28]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[29]  Nikolay V Dokholyan,et al.  Can contact potentials reliably predict stability of proteins? , 2004, Journal of molecular biology.

[30]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[31]  Bala Krishnamoorthy,et al.  BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/btm481 Structural bioinformatics Four-Body Scoring Function for Mutagenesis , 2007 .

[32]  Piero Fariselli,et al.  A three-state prediction of single point mutations on protein stability changes , 2007, BMC Bioinformatics.

[33]  L. Serrano,et al.  Predicting changes in the stability of proteins and protein complexes: a study of more than 1000 mutations. , 2002, Journal of molecular biology.

[34]  Emidio Capriotti,et al.  Bioinformatics Original Paper Predicting the Insurgence of Human Genetic Diseases Associated to Single Point Protein Mutations with Support Vector Machines and Evolutionary Information , 2022 .

[35]  Xiaoan Ruan,et al.  Specific gene-regulation networks during the pre-implantation development of the pig embryo as revealed by deep sequencing , 2014, BMC Genomics.

[36]  Piero Fariselli,et al.  A neural-network-based method for predicting protein stability changes upon single point mutations , 2004, ISMB/ECCB.

[37]  Predrag Radivojac,et al.  Automated inference of molecular mechanisms of disease from amino acid substitutions , 2009, Bioinform..

[38]  Bairong Shen,et al.  Structure-based prediction of the effects of a missense variant on protein stability , 2012, Amino Acids.

[39]  D. Baker,et al.  Role of conformational sampling in computing mutation‐induced changes in protein structure and stability , 2011, Proteins.

[40]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[41]  Abdul Sattar,et al.  Sequence-only evolutionary and predicted structural features for the prediction of stability changes in protein mutants , 2013, BMC Bioinformatics.

[42]  Xiaoyu Chu,et al.  Predicting changes in protein thermostability brought about by single- or multi-site mutations , 2010, BMC Bioinformatics.

[43]  Mauno Vihinen,et al.  Performance of protein stability predictors , 2010, Human mutation.

[44]  P. Bork,et al.  A method and server for predicting damaging missense mutations , 2010, Nature Methods.

[45]  Jaroslav Bendl,et al.  PredictSNP: Robust and Accurate Consensus Classifier for Prediction of Disease-Related Mutations , 2014, PLoS Comput. Biol..

[46]  Arlo Z. Randall,et al.  Prediction of protein stability changes for single‐site mutations using support vector machines , 2005, Proteins.

[47]  Chi-Wei Chen,et al.  iStable: off-the-shelf predictor integration for predicting protein stability changes , 2013, BMC Bioinformatics.

[48]  Liang-Tsung Huang,et al.  iPTREE-STAB: interpretable decision tree based method for predicting protein stability changes upon mutations , 2007, Bioinform..

[49]  R. Abagyan,et al.  Large‐scale prediction of protein geometry and stability changes for arbitrary single point mutations , 2004, Proteins.

[50]  Bairong Shen,et al.  Physicochemical feature-based classification of amino acid mutations. , 2007, Protein engineering, design & selection : PEDS.

[51]  Jens Meiler,et al.  Generation and evaluation of dimension-reduced amino acid parameter representations by artificial neural networks , 2001 .

[52]  M. Gromiha,et al.  Relationship Between Amino Acid Properties and Protein Stability: Buried Mutations , 1999, Journal of protein chemistry.

[53]  M. Vihinen,et al.  Accuracy of protein flexibility predictions , 1994, Proteins.

[54]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[55]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[56]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[57]  Yuedong Yang,et al.  DDIG-in: discriminating between disease-associated and neutral non-frameshifting micro-indels , 2013, Genome Biology.

[58]  Lin Song,et al.  Random generalized linear model: a highly accurate and interpretable ensemble predictor , 2013, BMC Bioinformatics.

[59]  Janet M. Thornton,et al.  Understanding the molecular machinery of genetics through 3D structures , 2008, Nature Reviews Genetics.

[60]  Shiow-Fen Hwang,et al.  Prediction of protein mutant stability using classification and regression tool. , 2007, Biophysical chemistry.

[61]  P. Baldi,et al.  Prediction of coordination number and relative solvent accessibility in proteins , 2002, Proteins.