Genome‐wide enzyme annotation with precision control: Catalytic families (CatFam) databases

In this article, we present a new method termed CatFam (Catalytic Families) to automatically infer the functions of catalytic proteins, which account for 20–40% of all proteins in living organisms and play a critical role in a variety of biological processes. CatFam is a sequence‐based method that generates sequence profiles to represent and infer protein catalytic functions. CatFam generates profiles through a stepwise procedure that carefully controls profile quality and employs nonenzymes as negative samples to establish profile‐specific thresholds associated with a predefined nominal false‐positive rate (FPR) of predictions. The adjustable FPR allows for fine precision control of each profile and enables the generation of profile databases that meet different needs: function annotation with high precision and hypothesis generation with moderate precision but better recall. Multiple tests of CatFam databases (generated with distinct nominal FPRs) against enzyme and nonenzyme datasets show that the method's predictions have consistently high precision and recall. For example, a 1% FPR database predicts protein catalytic functions for a dataset of enzymes and nonenzymes with 98.6% precision and 95.0% recall. Comparisons of CatFam databases against other established profile‐based methods for the functional annotation of 13 bacterial genomes indicate that CatFam consistently achieves higher precision and (in most cases) higher recall, and that (on average) CatFam provides 21.9% additional catalytic functions not inferred by the other similarly reliable methods. These results strongly suggest that the proposed method provides a valuable contribution to the automated prediction of protein catalytic functions. The CatFam databases and the database search program are freely available at http://www.bhsai.org/downloads/catfam.tar.gz. Proteins 2009. © 2008 Wiley‐Liss, Inc.

[1]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[2]  Peter D. Karp,et al.  Prediction of Enzyme Classification from Protein Sequence without the Use of Sequence Similarity , 1997, ISMB.

[3]  Gapped BLAST and PSI-BLAST: A new , 1997 .

[4]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[5]  Anil K. Jain,et al.  Data clustering: a review , 1999, CSUR.

[6]  Peter D. Karp,et al.  The Pathway Tools software , 2002, ISMB.

[7]  Søren Brunak,et al.  Prediction of novel archaeal enzymes from sequence‐derived features , 2002, Protein science : a publication of the Protein Society.

[8]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[9]  C. Claudel-Renard,et al.  Enzyme-specific profiles for genome annotation: PRIAM. , 2003, Nucleic acids research.

[10]  J. Whisstock,et al.  Prediction of protein function from protein sequence and structure , 2003, Quarterly Reviews of Biophysics.

[11]  Rodrigo Lopez,et al.  Multiple sequence alignment with the Clustal series of programs , 2003, Nucleic Acids Res..

[12]  Peter D. Karp,et al.  A Bayesian method for identifying missing enzymes in predicted metabolic pathway databases , 2004, BMC Bioinformatics.

[13]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[14]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[15]  Peter D. Karp,et al.  MetaCyc: a multiorganism database of metabolic pathways and enzymes , 2005, Nucleic Acids Res..

[16]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[17]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[18]  J. Skolnick,et al.  EFICAz: a comprehensive approach for accurate genome-scale enzyme function inference. , 2004, Nucleic acids research.

[19]  P. Karp Call for an enzyme genomics initiative , 2004, Genome Biology.

[20]  Janet M Thornton,et al.  The complement of enzymatic sets in different species. , 2005, Journal of molecular biology.

[21]  Walter R. Gilks,et al.  Probabilistic annotation of protein sequences based on functional classifications , 2005, BMC Bioinformatics.

[22]  Christian J. A. Sigrist,et al.  ProRule: a new database containing functional and structural information on PROSITE profiles , 2005, Bioinform..

[23]  P. Dobson,et al.  Predicting enzyme class from protein structure without alignments. , 2005, Journal of molecular biology.

[24]  Gail J. Bartlett,et al.  Effective function annotation through catalytic residue conservation. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Mikhail S. Gelfand,et al.  Mining sequence annotation databanks for association patterns , 2005, Bioinform..

[26]  Thy-Hou Lin,et al.  Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences , 2006, BMC Bioinformatics.

[27]  B. Labedan,et al.  Puzzling over orphan enzymes , 2006, Cellular and Molecular Life Sciences.

[28]  Peter D. Karp,et al.  MetaCyc: a multiorganism database of metabolic pathways and enzymes. , 2004, Nucleic acids research.

[29]  Anna Tramontano,et al.  Revisiting the prediction of protein function at CASP6 , 2006, The FEBS journal.

[30]  C. Médigue,et al.  MaGe: a microbial genome annotation system supported by synteny results , 2006, Nucleic acids research.

[31]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[32]  Amos Bairoch,et al.  ScanProsite: detection of PROSITE signature matches and ProRule-associated functional and structural residues in proteins , 2006, Nucleic Acids Res..

[33]  Weidong Tian,et al.  High precision multi-genome scale reannotation of enzyme function by EFICAz , 2006, BMC Genomics.

[34]  Ulrich Kuch,et al.  Biochemical and biological activities of the venom of the Chinese pitviper Zhaoermia mangshanensis, with the complete amino acid sequence and phylogenetic analysis of a novel Arg49 phospholipase A2 myotoxin. , 2006, Toxicon : official journal of the International Society on Toxinology.

[35]  Peter D. Karp,et al.  A survey of orphan enzyme activities , 2007, BMC Bioinformatics.

[36]  Neil Hall,et al.  Advanced sequencing technologies and their wider impact in microbiology , 2007, Journal of Experimental Biology.

[37]  Ronan M. T. Fleming,et al.  Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0 , 2007, Nature Protocols.

[38]  Walter R. Gilks,et al.  CORRIE: enzyme sequence annotation with confidence estimates , 2007, BMC Bioinformatics.

[39]  Peter D. Karp,et al.  Using genome-context data to identify specific types of functional associations in pathway/genome databases , 2007, ISMB/ECCB.

[40]  Claudine Médigue,et al.  Annotation, comparison and databases for hundreds of bacterial genomes. , 2007, Research in microbiology.

[41]  Ronan M. T. Fleming,et al.  Quantitative prediction of cellular metabolism with constraint-based models: the COBRA Toolbox v2.0 , 2007, Nature Protocols.