Application of an interpretable classification model on Early Folding Residues during protein folding

BackgroundMachine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models.ResultsGeneralized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/.ConclusionsThe application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results.

[1]  Andreas Prlic,et al.  Web-based molecular graphics for large complexes , 2016, Web3D.

[2]  K. Dill,et al.  The protein folding problem. , 1993, Annual review of biophysics.

[3]  B. Rost,et al.  Conservation and prediction of solvent accessibility in protein families , 1994, Proteins.

[4]  L. Mirny,et al.  Universally conserved positions in protein folds: reading evolutionary signals about stability, folding kinetics and function. , 1999, Journal of molecular biology.

[5]  Andreas Prlic,et al.  BioJava: an open-source framework for bioinformatics in 2012 , 2012, Bioinform..

[6]  Thomas Villmann,et al.  Limited Rank Matrix Learning, discriminative dimension reduction and visualization , 2012, Neural Networks.

[7]  Michael Biehl,et al.  Matrix relevance LVQ in steroid metabolomics based classification of adrenal tumors , 2012, ESANN.

[8]  Daniele Raimondi,et al.  Early Folding Events, Local Interactions, and Conservation of Protein Backbone Rigidity. , 2016, Biophysical journal.

[9]  L. Mayne,et al.  The nature of protein folding pathways , 2014, Proceedings of the National Academy of Sciences.

[10]  S Walter Englander,et al.  Protein folding and misfolding: mechanism and principles , 2007, Quarterly Reviews of Biophysics.

[11]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[12]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[13]  Leonard M. Freeman,et al.  A set of measures of centrality based upon betweenness , 1977 .

[14]  Pawel Kasprowski,et al.  Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery , 2015, Communications in Computer and Information Science.

[15]  S. Vishveshwara,et al.  A network representation of protein structures: implications for protein stability. , 2005, Biophysical journal.

[16]  G. Rose,et al.  Is protein folding hierarchic? II. Folding intermediates and transition states. , 1999, Trends in biochemical sciences.

[17]  Ian H. Witten,et al.  WEKA: a machine learning workbench , 1994, Proceedings of ANZIIS '94 - Australian New Zealnd Intelligent Information Systems Conference.

[18]  Nicolai Petkov,et al.  Automatic classification of the acrosome status of boar spermatozoa using digital image processing and LVQ , 2008, Comput. Biol. Medicine.

[19]  S. Walter Englander,et al.  Structural characterization of folding intermediates in cytochrome c by H-exchange labelling and proton NMR , 1988, Nature.

[20]  Thomas A. Hopf,et al.  Protein structure prediction from sequence variation , 2012, Nature Biotechnology.

[21]  S Walter Englander,et al.  The case for defined protein folding pathways , 2017, Proceedings of the National Academy of Sciences.

[22]  Thomas Villmann,et al.  A sparse kernelized matrix learning vector quantization model for human activity recognition , 2013, ESANN.

[23]  Fabian J Theis,et al.  SCANPY: large-scale single-cell gene expression data analysis , 2018, Genome Biology.

[24]  Pascal Benkert,et al.  QMEAN server for protein model quality estimation , 2009, Nucleic Acids Res..

[25]  Yawen Bai,et al.  Relationship between the native-state hydrogen exchange and folding pathways of a four-helix bundle protein. , 2002, Biochemistry.

[26]  Michael Schroeder,et al.  Characterizing the relation of functional and Early Folding Residues in protein structures using the example of aminoacyl-tRNA synthetases , 2018, PloS one.

[27]  Barbara Hammer,et al.  Transfer Learning for Rapid Re-calibration of a Myoelectric Prosthesis After Electrode Shift , 2017 .

[28]  Tom Lenaerts,et al.  From protein sequence to dynamics and disorder with DynaMine , 2013, Nature Communications.

[29]  Michael Biehl,et al.  LVQ and SVM Classification of FDG-PET Brain Data , 2016, WSOM.

[30]  Atsushi Sato,et al.  Generalized Learning Vector Quantization , 1995, NIPS.

[31]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[32]  G. Rose,et al.  Is protein folding hierarchic? I. Local structure and peptide folding. , 1999, Trends in biochemical sciences.

[33]  A. Shrake,et al.  Environment and exposure to solvent of protein atoms. Lysozyme and insulin. , 1973, Journal of molecular biology.

[34]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[35]  Gil Amitai,et al.  Network analysis of protein structures identifies functional residues. , 2004, Journal of molecular biology.

[36]  Thomas Villmann,et al.  Functional relevance learning in generalized learning vector quantization , 2012, Neurocomputing.

[37]  Teuvo Kohonen,et al.  Learning vector quantization , 1998 .

[38]  Dirk Labudde,et al.  eProS—a database and toolbox for investigating protein sequence–structure–function relationships through energy profiles , 2013, Nucleic Acids Res..

[39]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[40]  H. Scheraga,et al.  Experimental and theoretical aspects of protein folding. , 1975, Advances in protein chemistry.

[41]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[42]  M. Karplus,et al.  Three key residues form a critical contact network in a protein folding transition state , 2001, Nature.

[43]  Thomas Villmann,et al.  Aspects in Classification Learning - Review of Recent Developments in Learning Vector Quantization , 2014 .

[44]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[45]  P. Faísca,et al.  The nucleation mechanism of protein folding: a survey of computer simulation studies , 2009, Journal of physics. Condensed matter : an Institute of Physics journal.

[46]  Peter Tompa,et al.  Start2Fold: a database of hydrogen/deuterium exchange data on protein folding and stability , 2015, Nucleic Acids Res..

[47]  Alexander S. Rose,et al.  NGL Viewer: a web application for molecular visualization , 2015, Nucleic Acids Res..

[48]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[49]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[50]  Søren Brunak,et al.  Integrative network analysis highlights biological processes underlying GLP-1 stimulated insulin secretion: A DIRECT study , 2018, PloS one.

[51]  Michael Schroeder,et al.  PLIP: fully automated protein–ligand interaction profiler , 2015, Nucleic Acids Res..

[52]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[53]  Ellinor Haglund,et al.  Trimming Down a Protein Structure to Its Bare Foldons , 2011, The Journal of Biological Chemistry.

[54]  R. Li,et al.  The hydrogen exchange core and protein folding , 1999, Protein science : a publication of the Protein Society.

[55]  Andreas Prlic,et al.  Sequence analysis , 2003 .

[56]  E. Shakhnovich,et al.  Topological determinants of protein folding , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[57]  T. Sosnick,et al.  Protein folding intermediates: native-state hydrogen exchange. , 1995, Science.

[58]  Thomas Villmann,et al.  Generalized relevance learning vector quantization , 2002, Neural Networks.

[59]  Teuvo Kohonen,et al.  Self-Organizing Maps, Second Edition , 1997, Springer Series in Information Sciences.

[60]  Thomas Villmann,et al.  Generalized matrix learning vector quantizer for the analysis of spectral data , 2008, ESANN.

[61]  Michael Biehl,et al.  Distance Learning in Discriminative Vector Quantization , 2009, Neural Computation.

[62]  M Karplus,et al.  Small-world view of the amino acids that play a key role in protein folding. , 2002, Physical review. E, Statistical, nonlinear, and soft matter physics.

[63]  K. Dill Theory for the folding and stability of globular proteins. , 1985, Biochemistry.

[64]  Badri Adhikari,et al.  Improved protein structure reconstruction using secondary structures, contacts at higher distance thresholds, and non-contacts , 2017, BMC Bioinformatics.

[65]  Daniele Raimondi,et al.  Exploring the Sequence-based Prediction of Folding Initiation Sites in Proteins , 2017, Scientific Reports.