Application of an interpretable classification model on Early Folding Residues during protein folding

Background Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. Results Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver characteristic curve of 76.6% was achieved which is comparable to other state-of-the-art classifiers. Conclusions The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results.

[1]  A. Shrake,et al.  Environment and exposure to solvent of protein atoms. Lysozyme and insulin. , 1973, Journal of molecular biology.

[2]  Dirk Labudde,et al.  eProS—a database and toolbox for investigating protein sequence–structure–function relationships through energy profiles , 2013, Nucleic Acids Res..

[3]  Thomas Villmann,et al.  Limited Rank Matrix Learning, discriminative dimension reduction and visualization , 2012, Neural Networks.

[4]  Daniele Raimondi,et al.  Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins , 2017, Scientific Reports.

[5]  K. Ming Leung,et al.  Learning Vector Quantization , 2017, Encyclopedia of Machine Learning and Data Mining.

[6]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[7]  Badri Adhikari,et al.  Improved protein structure reconstruction using secondary structures, contacts at higher distance thresholds, and non-contacts , 2017, BMC Bioinformatics.

[8]  Sebastian Bittrich,et al.  eQuant - A Server for Fast Protein Model Quality Assessment by Integrating High-Dimensional Data and Machine Learning , 2015, BDAS.

[9]  K. Dill,et al.  The protein folding problem. , 1993, Annual review of biophysics.

[10]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[11]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[12]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[13]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[14]  M Karplus,et al.  Small-world view of the amino acids that play a key role in protein folding. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[15]  Michael Schroeder,et al.  Functional and Early Folding Residues are separated in proteins to increase evolvability and robustness , 2018, bioRxiv.

[16]  Michael Biehl,et al.  Distance Learning in Discriminative Vector Quantization , 2009, Neural Computation.

[17]  Ellinor Haglund,et al.  Trimming Down a Protein Structure to Its Bare Foldons , 2011, The Journal of Biological Chemistry.

[18]  L. Mayne,et al.  The nature of protein folding pathways , 2014, Proceedings of the National Academy of Sciences.

[19]  Michael Schroeder,et al.  Characterizing the relation of functional and Early Folding Residues in protein structures using the example of aminoacyl-tRNA synthetases , 2018, PloS one.

[20]  K. Dill Theory for the folding and stability of globular proteins. , 1985, Biochemistry.

[21]  Daniele Raimondi,et al.  Early Folding Events, Local Interactions, and Conservation of Protein Backbone Rigidity. , 2016, Biophysical journal.

[22]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[23]  Thomas Villmann,et al.  Generalized relevance learning vector quantization , 2002, Neural Networks.

[24]  M. Karplus,et al.  Three key residues form a critical contact network in a protein folding transition state , 2001, Nature.

[25]  Yawen Bai,et al.  Relationship between the native-state hydrogen exchange and folding pathways of a four-helix bundle protein. , 2002, Biochemistry.

[26]  Thomas Villmann,et al.  Functional relevance learning in generalized learning vector quantization , 2012, Neurocomputing.

[27]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[28]  G. Rose,et al.  Is protein folding hierarchic? II. Folding intermediates and transition states. , 1999, Trends in biochemical sciences.

[29]  E. Shakhnovich,et al.  Topological determinants of protein folding , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[30]  Atsushi Sato,et al.  Generalized Learning Vector Quantization , 1995, NIPS.

[31]  R. Li,et al.  The hydrogen exchange core and protein folding , 1999, Protein science : a publication of the Protein Society.

[32]  Pascal Benkert,et al.  QMEAN server for protein model quality estimation , 2009, Nucleic Acids Res..

[33]  Michael Schroeder,et al.  PLIP: fully automated protein–ligand interaction profiler , 2015, Nucleic Acids Res..

[34]  P. Faísca,et al.  The nucleation mechanism of protein folding: a survey of computer simulation studies , 2009, Journal of physics. Condensed matter : an Institute of Physics journal.

[35]  Tom Lenaerts,et al.  From protein sequence to dynamics and disorder with DynaMine , 2013, Nature Communications.

[36]  T. Sosnick,et al.  Protein folding intermediates: native-state hydrogen exchange. , 1995, Science.

[37]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[38]  S. Vishveshwara,et al.  A network representation of protein structures: implications for protein stability. , 2005, Biophysical journal.

[39]  Thomas Villmann,et al.  Aspects in Classification Learning - Review of Recent Developments in Learning Vector Quantization , 2014 .

[40]  G. Rose,et al.  Is protein folding hierarchic? I. Local structure and peptide folding. , 1999, Trends in biochemical sciences.

[41]  S Walter Englander,et al.  Protein folding and misfolding: mechanism and principles , 2007, Quarterly Reviews of Biophysics.

[42]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[43]  Peter Tompa,et al.  Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability , 2015, Nucleic Acids Res..

[44]  H. Scheraga,et al.  Experimental and theoretical aspects of protein folding. , 1975, Advances in protein chemistry.

[45]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[46]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[47]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[48]  S Walter Englander,et al.  The case for defined protein folding pathways , 2017, Proceedings of the National Academy of Sciences.

[49]  S. Walter Englander,et al.  Structural characterization of folding intermediates in cytochrome c by H-exchange labelling and proton NMR , 1988, Nature.

[50]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[51]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[52]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[53]  L. Mirny,et al.  Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. , 1999, Journal of molecular biology.