PON-tstab: Protein Variant Stability Predictor. Importance of Training Data Quality

Several methods have been developed to predict effects of amino acid substitutions on protein stability. Benchmark datasets are essential for method training and testing and have numerous requirements including that the data is representative for the investigated phenomenon. Available machine learning algorithms for variant stability have all been trained with ProTherm data. We noticed a number of issues with the contents, quality and relevance of the database. There were errors, but also features that had not been clearly communicated. Consequently, all machine learning variant stability predictors have been trained on biased and incorrect data. We obtained a corrected dataset and trained a random forests-based tool, PON-tstab, applicable to variants in any organism. Our results highlight the importance of the benchmark quality, suitability and appropriateness. Predictions are provided for three categories: stability decreasing, increasing and those not affecting stability.

[1]  Homme W Hellinga,et al.  Picomole-scale characterization of protein stability and function by quantitative cysteine reactivity , 2010, Proceedings of the National Academy of Sciences.

[2]  Yang Yang,et al.  PON-Sol: prediction of effects of amino acid substitutions on protein solubility , 2016, Bioinform..

[3]  Mario A. Fares,et al.  CAPS: coevolution analysis using protein sequences , 2006, Bioinform..

[4]  Mauno Vihinen,et al.  PON‐P: Integrated predictor for pathogenicity of missense variants , 2012, Human mutation.

[5]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[6]  Matthews Cr,et al.  Multiple replacements at position 211 in the alpha subunit of tryptophan synthase as a probe of the folding unit association reaction. , 1990 .

[7]  Piero Fariselli,et al.  A three-state prediction of single point mutations on protein stability changes , 2007, BMC Bioinformatics.

[8]  R. L. Baldwin,et al.  Cis proline mutants of ribonuclease A. I. thermal stability , 1992, Protein science : a publication of the Protein Society.

[9]  Yang Zhang,et al.  STRUM: structure-based prediction of protein stability changes upon single-point mutation , 2016, Bioinform..

[10]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[11]  Raghavan Varadarajan,et al.  Design of temperature-sensitive mutants solely from amino acid sequence. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Mallur S. Madhusudhan,et al.  TSpred: a web server for the rational design of temperature-sensitive mutants , 2014, Nucleic Acids Res..

[13]  Rodrigo Lopez,et al.  Clustal W and Clustal X version 2.0 , 2007, Bioinform..

[14]  Piero Fariselli,et al.  I-Mutant2.0: predicting stability changes upon mutation from the protein sequence or structure , 2005, Nucleic Acids Res..

[15]  Nikolay V Dokholyan,et al.  Can contact potentials reliably predict stability of proteins? , 2004, Journal of molecular biology.

[16]  N B Tweedy,et al.  Multiple replacements at position 211 in the alpha subunit of tryptophan synthase as a probe of the folding unit association reaction. , 1990, Biochemistry.

[17]  B. Matthews,et al.  Control of enzyme activity by an engineered disulfide bond. , 1994, Science.

[18]  Roland L. Dunbrack,et al.  The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[19]  Mauno Vihinen,et al.  Performance of protein stability predictors , 2010, Human mutation.

[20]  Iosif I. Vaisman,et al.  AUTO-MUTE 2.0: A Portable Framework with Enhanced Capabilities for Predicting Protein Functional Consequences upon Mutation , 2014, Adv. Bioinformatics.

[21]  Bairong Shen,et al.  Structure-based prediction of the effects of a missense variant on protein stability , 2012, Amino Acids.

[22]  James Green,et al.  ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins , 2015, BMC Bioinformatics.

[23]  N. Tokuriki,et al.  Modulating protein stability – directed evolution strategies for improved protein function , 2013, The FEBS Journal.

[24]  Mauno Vihinen,et al.  Guidelines for establishing locus specific databases , 2012, Human mutation.

[25]  R. Sauer,et al.  Genetic analysis of protein stability and function. , 1989, Annual review of genetics.

[26]  Mauno Vihinen,et al.  VariBench: A Benchmark Database for Variations , 2013, Human mutation.

[27]  M. Orozco,et al.  Characterization of disease-associated single amino acid polymorphisms in terms of sequence and structure properties. , 2002, Journal of molecular biology.

[28]  Abhishek Niroula,et al.  Predicting Severity of Disease‐Causing Variants , 2017, Human mutation.

[29]  C. Ó’Fágáin,et al.  Engineering protein stability. , 2011, Methods in molecular biology.

[30]  Jianwen Fang,et al.  PROTS-RF: A Robust Model for Predicting Mutation-Induced Protein Stability Changes , 2012, PloS one.

[31]  Piero Fariselli,et al.  INPS: predicting the impact of non-synonymous variations on protein stability from sequence , 2015, Bioinform..

[32]  Silvio C. E. Tosatto,et al.  Correct machine learning on protein sequences: a peer-reviewing perspective , 2016, Briefings Bioinform..

[33]  Mauno Vihinen,et al.  Guidelines for Reporting and Using Prediction Tools for Genetic Variation Analysis , 2013, Human mutation.

[34]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[35]  Akinori Sarai,et al.  ProTherm and ProNIT: thermodynamic databases for proteins and protein–nucleic acid interactions , 2005, Nucleic Acids Res..

[36]  Abhishek Niroula,et al.  Classification of Amino Acid Substitutions in Mismatch Repair Proteins Using PON‐MMR2 , 2015, Human mutation.

[37]  Bairong Shen,et al.  Conservation and covariance in PH domain sequences: physicochemical profile and information theoretical analysis of XLA-causing mutations in the Btk PH domain. , 2004, Protein engineering, design & selection : PEDS.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  G. Schreiber,et al.  Assessing computational methods for predicting protein stability upon mutation: good on average but not in the details. , 2009, Protein engineering, design & selection : PEDS.

[40]  Ian Walsh,et al.  NeEMO: a method using residue interaction networks to improve prediction of protein stability upon mutation , 2014, BMC Genomics.

[41]  Ning Ma,et al.  BLAST+: architecture and applications , 2009, BMC Bioinformatics.

[42]  M. Vihinen,et al.  Variation Interpretation Predictors: Principles, Types, Performance, and Choice , 2016, Human mutation.

[43]  G L Gilliland,et al.  Engineering the independent folding of the subtilisin BPN' prodomain: analysis of two-state folding versus protein stability. , 1997, Biochemistry.

[44]  P. Ye,et al.  Neighborhood Properties Are Important Determinants of Temperature Sensitive Mutations , 2011, PloS one.

[45]  L. D. Ward,et al.  Roles of histidine 31 and tryptophan 34 in the structure, self-association, and folding of murine interleukin-6. , 1997, Biochemistry.

[46]  C. Matthews,et al.  Mutagenic analysis of the interior packing of an alpha/beta barrel protein. Effects on the stabilities and rates of interconversion of the native and partially folded forms of the alpha subunit of tryptophan synthase. , 1993, Biochemistry.

[47]  Marta Bueno,et al.  Structure of stable protein folding intermediates by equilibrium phi-analysis: the apoflavodoxin thermal intermediate. , 2004, Journal of molecular biology.

[48]  Bela Stantic,et al.  EASE-MM: Sequence-Based Prediction of Mutation-Induced Stability Changes with Feature-Based Multiple Models. , 2016, Journal of molecular biology.

[49]  Douglas E. V. Pires,et al.  DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach , 2014, Nucleic Acids Res..

[50]  Richard Bonneau,et al.  Rational Design of Temperature-Sensitive Alleles Using Computational Structure Prediction , 2011, PloS one.

[51]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[52]  M. Vihinen,et al.  PON-P2: Prediction Method for Fast and Reliable Identification of Harmful Variants , 2015, PloS one.