SMpred: A Support Vector Machine Approach to Identify Structural Motifs in Protein Structure Without Using Evolutionary Information

Abstract Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.

[1]  H D Dakin,et al.  On Amino-acids. , 1918, The Biochemical journal.

[2]  H. B. Theoretical Biology , 2020, Nature.

[3]  S. Lowen The Biophysical Journal , 1960, Nature.

[4]  E. Hill Journal of Theoretical Biology , 1961, Nature.

[5]  C. Anfinsen Principles that govern the folding of protein chains. , 1973, Science.

[6]  T. Emery,et al.  Peptides , 1964, Peptides.

[7]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[8]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[9]  Ron Unger,et al.  The importance of short structural motifs in protein structure analysis , 1993, J. Comput. Aided Mol. Des..

[10]  G J Barton,et al.  Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. , 1994, Journal of molecular biology.

[11]  David T. Jones,et al.  Protein superfamilles and domain superfolds , 1994, Nature.

[12]  M J Sternberg,et al.  Identification of sequence motifs from a set of proteins with related function. , 1994, Protein engineering.

[13]  Xiaodong Cheng,et al.  The structure of bacteriophage T7 lysozyme, a zinc amidase and an inhibitor of T7 RNA polymerase. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[15]  J. Thornton,et al.  PROMOTIF—A program to identify and analyze structural motifs in proteins , 1996, Protein science : a publication of the Protein Society.

[16]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[17]  S. Lippard,et al.  Crystal structures of the methane monooxygenase hydroxylase from Methylococcus capsulatus (Bath): Implications for substrate gating and component interactions , 1997, Proteins.

[18]  Charlotte M. Deane,et al.  JOY: protein sequence-structure representation and analysis , 1998, Bioinform..

[19]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[20]  G J Kleywegt,et al.  Recognition of spatial motifs in protein structures. , 1999, Journal of molecular biology.

[21]  김삼묘,et al.  “Bioinformatics” 특집을 내면서 , 2000 .

[22]  William R. Taylor,et al.  Structure Motif Discovery and Mining the PDB , 2002, German Conference on Bioinformatics.

[23]  Pete Biggs,et al.  Computers in Chemistry , 2000 .

[24]  Shmuel Pietrokovski,et al.  Increased coverage of protein families with the Blocks Database servers , 2000, Nucleic Acids Res..

[25]  M. Saraste,et al.  FEBS Lett , 2000 .

[26]  Arne Elofsson,et al.  A study of quality measures for protein threading models , 2001, BMC Bioinformatics.

[27]  Kohei Oda,et al.  Carboxyl proteinase from Pseudomonas defines a novel family of subtilisin-like enzymes , 2001, Nature Structural Biology.

[28]  Kuo-Chen Chou,et al.  Support vector machines for predicting HIV protease cleavage sites in protein , 2002, J. Comput. Chem..

[29]  Kuo-Chen Chou,et al.  Support vector machines for the classification and prediction of β‐turn types , 2002, Journal of peptide science : an official publication of the European Peptide Society.

[30]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[31]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[32]  K. Chou,et al.  Support vector machines for predicting the specificity of GalNAc-transferase , 2002, Peptides.

[33]  K. Chou,et al.  Support vector machines for predicting membrane protein types by using functional domain composition. , 2003, Biophysical journal.

[34]  Saikat Chakrabarti,et al.  SMoS: a database of structural motifs of protein superfamilies. , 2003, Protein engineering.

[35]  Kuo-Chen Chou,et al.  Support Vector Machine for predicting α-turn types , 2003, Peptides.

[36]  Kuo-Chen Chou,et al.  Support vector machines for prediction of protein signal sequences and their cleavage sites , 2003, Peptides.

[37]  Saikat Chakrabarti,et al.  Regions of minimal structural variation among members of protein domain superfamilies: application to remote homology detection and modelling using distant relationships , 2004, FEBS letters.

[38]  K. Chou,et al.  Application of SVM to predict membrane protein types. , 2004, Journal of theoretical biology.

[39]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[40]  Kuo-Chen Chou,et al.  Identify catalytic triads of serine hydrolases by support vector machines. , 2004, Journal of theoretical biology.

[41]  Improvement of comparative modeling by the application of conserved motifs amongst distantly related proteins as additional restraints , 2004, Journal of molecular modeling.

[42]  H. Wolfson,et al.  Potential folding–function interrelationship in proteins , 2004, Proteins.

[43]  William R Taylor,et al.  Toward the detection and validation of repeats in protein structure , 2004, Proteins.

[44]  Journal of Computer-Aided Molecular Design incorporating Perspectives in Drug Discovery and Design , 2005 .

[45]  X.-D. Sun,et al.  Prediction of protein structural classes using support vector machines , 2006, Amino Acids.

[46]  宁北芳,et al.  疟原虫var基因转换速率变化导致抗原变异[英]/Paul H, Robert P, Christodoulou Z, et al//Proc Natl Acad Sci U S A , 2005 .

[47]  Ramanathan Sowdhamini,et al.  SCANMOT: searching for similar sequences using a simultaneous scan of multiple sequence motifs , 2005, Nucleic Acids Res..

[48]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[49]  Ramanathan Sowdhamini,et al.  SSToSS - Sequence-Structural Templates of Single-Member Superfamilies , 2006, Silico Biol..

[50]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[51]  Ponnuthurai N. Suganthan,et al.  SMotif: a server for structural motifs in proteins , 2007, Bioinform..

[52]  BMC Bioinformatics , 2005 .

[53]  A. Pal,et al.  Three-dimensional Models of NB-ARC Domains of Disease Resistance Proteins in Tomato, Arabidopsis, and Flax , 2008, Journal of biomolecular structure & dynamics.

[54]  Martin C. Frith,et al.  Discovering Sequence Motifs with Arbitrary Insertions and Deletions , 2008, PLoS Comput. Biol..

[55]  Sai Kumar Ramadugu,et al.  Study of Early Events in the Protein Folding of Villin Headpiece using Molecular Dynamics Simulation , 2008, Journal of biomolecular structure & dynamics.

[56]  Sujit Roy,et al.  Sequential, Structural, and Phylogenetic Study of BRCT Module in Plants , 2008, Journal of biomolecular structure & dynamics.

[57]  P. Suganthan,et al.  Identification of catalytic residues from protein structure using support vector machine with sequence and structural features. , 2008, Biochemical and biophysical research communications.

[58]  T. Ghosh,et al.  Structural Interaction Between DsrE-DsrF-DsrH Proteins Involved in the Transport of Electrons in the dsr Operon , 2008, Journal of biomolecular structure & dynamics.

[59]  Daniela Josa,et al.  Homology Modeling of Wild-type, D516V, and H526L Mycobacterium Tuberculosis RNA Polymerase and Their Molecular Docking Study with Inhibitors , 2008, Journal of biomolecular structure & dynamics.

[60]  J. Dasgupta,et al.  Structural Determinants of V. cholerae CheYs that Discriminate them in FliM binding: Comparative Modeling and MD Simulation Studies , 2008, Journal of biomolecular structure & dynamics.

[61]  Ponnuthurai N. Suganthan,et al.  MegaMotifBase: a database of structural motifs in protein families and superfamilies , 2008, Nucleic Acids Res..

[62]  Seema Mishra Function Prediction of Rv0079, A Hypothetical Mycobacterium tuberculosis DosR Regulon Protein , 2009, Journal of biomolecular structure & dynamics.

[63]  Ponnuthurai N. Suganthan,et al.  Identification of structurally conserved residues of proteins in absence of structural homologs using neural network ensemble , 2008, Bioinform..

[64]  Rajaiah Shenbagarathai,et al.  Sequence Analysis, Structure Prediction, and Functional Validation of phaC1/phaC2 Genes of Pseudomonas sp. LDC-25 and Its Importance in Polyhydroxyalkanoate Accumulation , 2009, Journal of biomolecular structure & dynamics.

[65]  Sven Rahmann,et al.  Efficient exact motif discovery , 2009, Bioinform..

[66]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[67]  Piramanayagam Shanmughavel,et al.  Molecular Modeling of Human Pentameric α7 Neuronal Nicotinic Acetylcholine Receptor and Its Interaction with its Agonist and Competitive Antagonist , 2009, Journal of biomolecular structure & dynamics.

[68]  Dinesh Gupta,et al.  Molecular Modeling Studies of the Interaction Between Plasmodium falciparum HslU and HslV Subunits , 2009, Journal of biomolecular structure & dynamics.

[69]  Bipin G. Nair,et al.  Homology Modeling of GLUT4, an Insulin Regulated Facilitated Glucose Transporter and Docking Studies with ATP and its Inhibitors , 2009, Journal of biomolecular structure & dynamics.

[70]  Narayanaswamy Srinivasan,et al.  Length Variations amongst Protein Domain Superfamilies and Consequences on Structure and Function , 2009, PloS one.

[71]  I. Ghosh,et al.  Determination of Phosphorylation Sites for NADP-specific Isocitrate Dehydrogenase from Mycobacterium tuberculosis , 2009, Journal of biomolecular structure & dynamics.

[72]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..