Functional Representation of Enzymes by Specific Peptides

Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 ± 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.

[1]  Li Liao,et al.  Combining Pairwise Sequence Similarity and Support Vector Machines for Detecting Remote Protein Evolutionary and Structural Relationships , 2003, J. Comput. Biol..

[2]  Douglas L. Brutlag,et al.  The EMOTIF database , 2001, Nucleic Acids Res..

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  P. Bork,et al.  Protein sequence motifs. , 1996, Current opinion in structural biology.

[5]  D. Brutlag,et al.  Highly specific protein sequence motifs for genome analysis. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[6]  J. Walker,et al.  Distantly related sequences in the alpha‐ and beta‐subunits of ATP synthase, myosin, kinases and other ATP‐requiring enzymes and a common nucleotide binding fold. , 1982, The EMBO journal.

[7]  Gary D Bader,et al.  A Combined Experimental and Computational Strategy to Define Protein Interaction Networks for Peptide Recognition Modules , 2001, Science.

[8]  Eytan Ruppin,et al.  Unsupervised learning of natural languages , 2006 .

[9]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[10]  J L Benach,et al.  A glyceraldehyde-3-phosphate dehydrogenase homolog in Borrelia burgdorferi and Borrelia hermsii , 1996, Infection and immunity.

[11]  Amos Bairoch,et al.  The PROSITE database, its status in 2002 , 2002, Nucleic Acids Res..

[12]  T. Hunter,et al.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[13]  Douglas L. Brutlag,et al.  Sequence Motifs: Highly Predictive Features of Protein Function , 2006, Feature Extraction.

[14]  Burkhard Rost,et al.  The PredictProtein server , 2003, Nucleic Acids Res..

[15]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[16]  Thomas Lengauer,et al.  Protein function from sequence and structure data. , 2003, Applied bioinformatics.

[17]  Jie Liang,et al.  CASTp: Computed Atlas of Surface Topography of proteins , 2003, Nucleic Acids Res..

[18]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.

[19]  Amos Bairoch,et al.  The PROSITE database, its status in 1997 , 1997, Nucleic Acids Res..

[20]  D. Shasha,et al.  Discovering active motifs in sets of related protein sequences and using them for classification. , 1994, Nucleic acids research.

[21]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[22]  Dariusz Plewczynski,et al.  PDB-UF: database of predicted enzymatic functions for unannotated protein structures from structural genomics , 2006, BMC Bioinformatics.

[23]  Alastair Aitken,et al.  Protein consensus sequence motifs , 1999, Methods in molecular biology.

[24]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[25]  Daisuke Kihara,et al.  Enhanced automated function prediction using distantly related sequences and contextual association by PFP , 2006, Protein science : a publication of the Protein Society.

[26]  Robert M. Stroud,et al.  The structural basis for tRNA recognition and pseudouridine formation by pseudouridine synthase I , 2000, Nature Structural Biology.

[27]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[28]  Geoffrey J. Barton,et al.  GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes , 2004, BMC Bioinformatics.

[29]  Douglas L. Brutlag,et al.  Remote homology detection: a motif based approach , 2003, ISMB.

[30]  M Kanehisa,et al.  Construction of a dictionary of sequence motifs that characterize groups of related proteins , 1992, Protein engineering.

[31]  M. Gerstein,et al.  The relationship between protein structure and function: a comprehensive survey with application to the yeast genome. , 1999, Journal of molecular biology.

[32]  C Ouzounis,et al.  Dictionary building via unsupervised hierarchical motif discovery in the sequence space of natural proteins , 1999, Proteins.