Prediction of Enzyme Classification from Protein Sequence without the Use of Sequence Similarity

We describe a novel approach for predicting the function of a protein from its amino-acid sequence. Given features that can be computed from the amino-acid sequence in a straightforward fashion (such as pI, molecular weight, and amino-acid composition), the technique allows us to answer questions such as: Is the protein an enzyme? If so, in which Enzyme Commission (EC) class does it belong? Our approach uses machine learning (ML) techniques to induce classifiers that predict the EC class of an enzyme from features extracted from its primary sequence. We report on a variety of experiments in which we explored the use of three different ML techniques in conjunction with training datasets derived from PDB and from Swiss-Prot. We also explored the use of several different feature sets. Our method is able to predict the first EC number of an enzyme with 74% accuracy (thereby assigning the enzyme to one of six broad categories of enzyme function), and to predict the second EC number of an enzyme with 68% accuracy (thereby assigning the enzyme to one of 57 subcategories of enzyme function). This technique could be a valuable complement to sequence-similarity searches and to pathway-analysis methods.

[1]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[2]  C R Woese,et al.  Methanococcus jannaschii genome: revisited. , 1996, Microbial & comparative genomics.

[3]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[4]  Cathy H. Wu,et al.  Gene Classification Artificial Neural System , 1995, Int. J. Artif. Intell. Tools.

[5]  Ron Kohavi,et al.  MLC++: a machine learning library in C++ , 1994, Proceedings Sixth International Conference on Tools with Artificial Intelligence. TAI 94.

[6]  C. Sander,et al.  From genome sequences to protein function , 1994 .

[7]  Peter D. Karp,et al.  HinCyc: A Knowledge Base of the Complete Genome and Metabolic Pathways of H. influenzae , 1996, ISMB.

[8]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[9]  Cathy H. Wu,et al.  Gene classification artificial neural system , 1995, Proceedings First International Symposium on Intelligence in Neural and Biological Systems. INBS'95.

[10]  E. Webb Enzyme nomenclature 1992. Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology on the Nomenclature and Classification of Enzymes. , 1992 .

[11]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[12]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[13]  C. Sander,et al.  Challenging times for bioinformatics , 1995, Nature.