Enzyme family classification by support vector machines

One approach for facilitating protein function prediction is to classify proteins into functional families. Recent studies on the classification of G‐protein coupled receptors and other proteins suggest that a statistical learning method, Support vector machines (SVM), may be potentially useful for protein classification into functional families. In this work, SVM is applied and tested on the classification of enzymes into functional families defined by the Enzyme Nomenclature Committee of IUBMB. SVM classification system for each family is trained from representative enzymes of that family and seed proteins of Pfam curated protein families. The classification accuracy for enzymes from 46 families and for non‐enzymes is in the range of 50.0% to 95.7% and 79.0% to 100% respectively. The corresponding Matthews correlation coefficient is in the range of 54.1% to 96.1%. Moreover, 80.3% of the 8,291 correctly classified enzymes are uniquely classified into a specific enzyme family by using a scoring function, indicating that SVM may have certain level of unique prediction capability. Testing results also suggest that SVM in some cases is capable of classification of distantly related enzymes and homologous enzymes of different functions. Effort is being made to use a more comprehensive set of enzymes as training sets and to incorporate multi‐class SVM classification systems to further enhance the unique prediction accuracy. Our results suggest the potential of SVM for enzyme family classification and for facilitating protein function prediction. Our software is accessible at http://jing.cz3.nus.edu.sg/cgi‐bin/svmprot.cgi. Proteins 2004. © 2004 Wiley‐Liss, Inc.

[1]  H. Bull,et al.  Surface tension of amino acid solutions: a hydrophobicity scale of the amino acid residues. , 1974, Archives of biochemistry and biophysics.

[2]  K. Titani,et al.  Complete amino acid sequence of rat liver cytosolic alanine aminotransferase. , 1991, Biochemistry.

[3]  P. Argos,et al.  Recognition of distantly related protein sequences using conserved motifs and neural networks. , 1992, Journal of molecular biology.

[4]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[5]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[6]  R. Abagyan,et al.  Recognition of distantly related proteins through energy calculations , 1994, Proteins.

[7]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S. Hirai,et al.  Activation of the JNK pathway by distantly related protein kinases, MEKK and MUK. , 1996, Oncogene.

[9]  S. N. Timasheff,et al.  On the role of surface tension in the stabilization of globular proteins , 1996, Protein science : a publication of the Protein Society.

[10]  R. Hyde,et al.  Structural studies of aminopeptidase P. A novel cellular peptidase. , 1997, Advances in experimental medicine and biology.

[11]  L. Hood,et al.  Gene families: the taxonomy of protein paralogs and chimeras. , 1997, Science.

[12]  R. Stevens,et al.  Crystal structure of botulinum neurotoxin type A and implications for toxicity , 1998, Nature Structural Biology.

[13]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[14]  J. C. BurgesChristopher A Tutorial on Support Vector Machines for Pattern Recognition , 1998 .

[15]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[16]  P. Bork,et al.  Predicting functions from protein sequences—where are the bottlenecks? , 1998, Nature Genetics.

[17]  A D Baxevanis,et al.  Practical aspects of multiple sequence alignment. , 1998, Methods of biochemical analysis.

[18]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[19]  Y. Miyata,et al.  Distantly related cousins of MAP kinase: biochemical properties and possible physiological functions. , 1999, Biochemical and biophysical research communications.

[20]  Anton J. Enright,et al.  Protein interaction maps for complete genomes based on gene fusion events , 1999, Nature.

[21]  S. Benner,et al.  Functional inferences from reconstructed evolutionary biology involving rectified databases--an evolutionarily grounded approach to functional genomics. , 2000, Research in microbiology.

[22]  Nello Cristianini,et al.  An introduction to Support Vector Machines , 2000 .

[23]  B. Matthews,et al.  Structure and function of the methionine aminopeptidases. , 2000, Biochimica et biophysica acta.

[24]  Warren C. Lathe,et al.  Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. , 2000, Genome research.

[25]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[26]  D. Eisenberg,et al.  Protein function in the post-genomic era , 2000, Nature.

[27]  Sarah A. Teichmann,et al.  Computing protein function , 2000, Nature Biotechnology.

[28]  Anton J. Enright,et al.  GeneRAGE: a robust algorithm for sequence clustering and domain detection , 2000, Bioinform..

[29]  L. Aravind Guilt by association: contextual information in genome analysis. , 2000, Genome research.

[30]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[31]  Chris P. Ponting,et al.  Issues in Predicting Protein Function From Sequence , 2001, Briefings Bioinform..

[32]  J. Thornton,et al.  The (βα)8 glycosidases: sequence and structure analyses suggest distant evolutionary relationships , 2001 .

[33]  C. Chothia,et al.  Determination of protein function, evolution and interactions by structural genomics. , 2001, Current opinion in structural biology.

[34]  M. Pellegrini,et al.  Computational methods for protein function analysis. , 2001, Current opinion in chemical biology.

[35]  J. Thornton,et al.  The (betaalpha)(8) glycosidases: sequence and structure analyses suggest distant evolutionary relationships. , 2001, Protein engineering.

[36]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[37]  David A. Gough,et al.  Predicting protein-protein interactions from primary structure , 2001, Bioinform..

[38]  J. Skolnick,et al.  Enhanced functional annotation of protein sequences via the use of structural descriptors. , 2001, Journal of structural biology.

[39]  G. Schuler,et al.  Sequence alignment and database searching. , 2001, Methods of biochemical analysis.

[40]  S. Hua,et al.  A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. , 2001, Journal of molecular biology.

[41]  Kevin Burrage,et al.  Prediction of protein solvent accessibility using support vector machines , 2002, Proteins.

[42]  C. Orengo,et al.  Plasticity of enzyme active sites. , 2002, Trends in biochemical sciences.

[43]  Kuo-Chen Chou,et al.  Support vector machines for predicting HIV protease cleavage sites in protein , 2002, J. Comput. Chem..

[44]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[45]  An-Suei Yang,et al.  Structure-dependent sequence alignment for remotely related proteins , 2002, Bioinform..

[46]  Yukiko Fujiwara,et al.  Protein function prediction using hidden Markov models and neural networks : Bioinformatics , 2002 .

[47]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[48]  Antje Chang,et al.  BRENDA, enzyme data and metabolic information , 2002, Nucleic Acids Res..

[49]  Anton J. Enright,et al.  An efficient algorithm for large-scale detection of protein families. , 2002, Nucleic acids research.

[50]  Dustin Boswell,et al.  Introduction to Support Vector Machines , 2002 .

[51]  David Haussler,et al.  Classifying G-protein coupled receptors with support vector machines , 2002, Bioinform..

[52]  Yu Zong Chen,et al.  Support Vector Machine Classification Of Physical And Biological Datasets , 2003 .

[53]  Chih-Jen Lin,et al.  Fine‐grained protein fold assignment by support vector machines using generalized npeptide coding schemes and jury voting from multiple‐parameter sets , 2003, Proteins.

[54]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[55]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[56]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[57]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.