DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe

BackgroundComputational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits.ResultsDomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes.ConclusionsOur results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences.

[1]  J. Thornton,et al.  Missing in action: enzyme functional annotations in biological databases. , 2009, Nature chemical biology.

[2]  Suzanna Lewis,et al.  Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium , 2011, Briefings Bioinform..

[3]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[4]  Dietmar Schomburg,et al.  EnzymeDetector: an integrated enzyme function prediction tool and database , 2011, BMC Bioinformatics.

[5]  Cathy H. Wu,et al.  Activities at the Universal Protein Resource (UniProt) , 2014, Nucleic Acids Research.

[6]  Jeffrey Skolnick,et al.  EFICAz2.5: application of a high-precision enzyme function predictor to 396 proteomes , 2012, Bioinform..

[7]  K. Chou,et al.  EzyPred: a top-down approach for predicting enzyme functional classes and subclasses. , 2007, Biochemical and biophysical research communications.

[8]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[9]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[10]  Grigorios Tsoumakas,et al.  MULAN: A Java Library for Multi-Label Learning , 2011, J. Mach. Learn. Res..

[11]  Kai Blin,et al.  antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences , 2011, Nucleic Acids Res..

[12]  Adam M. Feist,et al.  A comprehensive genome-scale reconstruction of Escherichia coli metabolism—2011 , 2011, Molecular systems biology.

[13]  Timothy S. Ham,et al.  Production of the antimalarial drug precursor artemisinic acid in engineered yeast , 2006, Nature.

[14]  Pablo Carbonell,et al.  XTMS: pathway design in an eXTended metabolic space , 2014, Nucleic Acids Res..

[15]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[16]  Ying Huang,et al.  EFICAz2: enzyme function inference by a combined approach enhanced by machine learning , 2009, BMC Bioinformatics.

[17]  J A Blake,et al.  Program description: Strategies for biological annotation of mammalian systems: implementing gene ontologies in mouse genome informatics. , 2001, Genomics.

[18]  Andrew M. Lynn,et al.  ModEnzA: Accurate Identification of Metabolic Enzymes Using Function Specific Profile HMMs with Optimised Discrimination Threshold and Modified Emission Probabilities , 2011, Adv. Bioinformatics.

[19]  Olivier Lichtarge,et al.  Evolution-guided discovery and recoding of allosteric pathway specificity determinants in psychoactive bioamine receptors , 2010, Proceedings of the National Academy of Sciences.

[20]  Alison S. Waller,et al.  Prediction and identification of sequences coding for orphan enzymes using genomic and metagenomic neighbours , 2012, Molecular systems biology.

[21]  Katherine H. Huang,et al.  A framework for human microbiome research , 2012, Nature.

[22]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[23]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[24]  Ronen Feldman,et al.  The Data Mining and Knowledge Discovery Handbook , 2005 .

[25]  Erik L. L. Sonnhammer,et al.  Predicting protein function from domain content , 2008, Bioinform..

[26]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[27]  Sean R. Eddy,et al.  Accelerated Profile HMM Searches , 2011, PLoS Comput. Biol..

[28]  Fangfang Xia,et al.  The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST) , 2013, Nucleic Acids Res..

[29]  Jano I. van Hemert,et al.  EnzML: multi-label prediction of enzyme classes using InterPro signatures , 2012, BMC Bioinformatics.

[30]  Yong Wang,et al.  Support vector machine prediction of enzyme function with conjoint triad feature and hierarchical context , 2011, BMC Systems Biology.

[31]  S. Tringe,et al.  Metagenomic Discovery of Biomass-Degrading Genes and Genomes from Cow Rumen , 2011, Science.

[32]  Daisuke Kihara,et al.  Protein domain recurrence and order can enhance prediction of protein functions , 2012, Bioinform..

[33]  Juho Rousu,et al.  Computational methods for metabolic reconstruction. , 2010, Current opinion in biotechnology.

[34]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[35]  Michael A. Hicks,et al.  The Structure–Function Linkage Database , 2013, Nucleic Acids Res..

[36]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[37]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[38]  Yang Zhang,et al.  COFACTOR: an accurate comparative algorithm for structure-based protein function annotation , 2012, Nucleic Acids Res..

[39]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[40]  Ritesh Kumar,et al.  Discovery of new enzymes and metabolic pathways using structure and genome context , 2016 .

[41]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[42]  Olivier Lichtarge,et al.  Prediction and experimental validation of enzyme substrate specificity in protein structures , 2013, Proceedings of the National Academy of Sciences.

[43]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[44]  Jaques Reifman,et al.  Genome‐wide enzyme annotation with precision control: Catalytic families (CatFam) databases , 2009, Proteins.

[45]  Juwen Shen,et al.  Predicting protein–protein interactions based only on sequences information , 2007, Proceedings of the National Academy of Sciences.

[46]  M. Schallmey,et al.  Expanding the Halohydrin Dehalogenase Enzyme Family: Identification of Novel Enzymes by Database Mining , 2014, Applied and Environmental Microbiology.

[47]  Hai Fang,et al.  dcGO: database of domain-centric ontologies on functions, phenotypes, diseases and more , 2012, Nucleic Acids Res..

[48]  S. Kravitz,et al.  The JCVI standard operating procedure for annotating prokaryotic metagenomic shotgun sequencing data , 2010, Standards in genomic sciences.

[49]  David A. Lee,et al.  Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. , 2009, Journal of molecular biology.

[50]  J. Thornton,et al.  Domain–ligand mapping for enzymes , 2009, Journal of molecular recognition : JMR.

[51]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[52]  Elisabeth Coudert,et al.  HAMAP in 2013, new developments in the protein family classification and annotation system , 2012, Nucleic Acids Res..

[53]  Kenji Mizuguchi,et al.  Relationships between functional subclasses and information contained in active‐site and ligand‐binding residues in diverse superfamilies , 2010, Proteins.

[54]  David A. Lee,et al.  GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains , 2009, Nucleic acids research.