Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation.

[1]  Giorgio Valle,et al.  The Gene Ontology project in 2008 , 2007, Nucleic Acids Res..

[2]  David A. Lee,et al.  Predicting protein function from sequence and structure , 2007, Nature Reviews Molecular Cell Biology.

[3]  Peter D Karp,et al.  The past, present and future of genome-wide re-annotation , 2002, Genome Biology.

[4]  Janet M. Thornton,et al.  SCOPEC: a database of protein catalytic domains , 2004, ISMB/ECCB.

[5]  R. Overbeek,et al.  The use of gene clusters to infer functional coupling. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[6]  W. Fischer,et al.  Novel hopanoid cyclases from the environment. , 2007, Environmental microbiology.

[7]  K. Soda,et al.  Comprehensive site-directed mutagenesis of L-2-halo acid dehalogenase to probe catalytic amino acid residues. , 1995, Journal of biochemistry.

[8]  Bernard Labedan,et al.  Retrieving sequences of enzymes experimentally characterized but erroneously annotated : the case of the putrescine carbamoyltransferase , 2004, BMC Genomics.

[9]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[10]  Robert D. Finn,et al.  InterPro: the integrative protein signature database , 2008, Nucleic Acids Res..

[11]  International Human Genome Sequencing Consortium Initial sequencing and analysis of the human genome , 2001, Nature.

[12]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[13]  Conrad C. Huang,et al.  Leveraging enzyme structure-function relationships for functional inference and experimental design: the structure-function linkage database. , 2006, Biochemistry.

[14]  C. Ouzounis,et al.  Percolation of annotation errors through hierarchically structured protein sequence databases. , 2005, Mathematical biosciences.

[15]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[16]  Nils Hallenberg,et al.  Preserving accuracy in GenBank , 2008 .

[17]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[18]  A. Valencia,et al.  Intrinsic errors in genome annotation. , 2001, Trends in genetics : TIG.

[19]  Peer Bork,et al.  Protein function space: viewing the limits or limited by our view? , 2007, Current opinion in structural biology.

[20]  J. Skolnick,et al.  Structure‐based functional motif identifies a potential disulfide oxidoreductase active site in the serine/threonine protein phosphatase‐1 subfamily , 1999, FASEB journal : official publication of the Federation of American Societies for Experimental Biology.

[21]  Vasant Honavar,et al.  Exploring inconsistencies in genome-wide protein function annotations: a machine learning approach , 2007, BMC Bioinformatics.

[22]  The UniProt Consortium,et al.  The Universal Protein Resource (UniProt) 2009 , 2008, Nucleic Acids Res..

[23]  Elizabeth Pennisi,et al.  Proposal to 'Wikify' GenBank Meets Stiff Resistance , 2008, Science.

[24]  P. Bork,et al.  Analysis of genomic context: prediction of functional associations from conserved bidirectionally transcribed gene pairs , 2004, Nature Biotechnology.

[25]  Robert C. Edgar,et al.  MUSCLE: multiple sequence alignment with high accuracy and high throughput. , 2004, Nucleic acids research.

[26]  Caroline Hadley,et al.  Righting the wrongs , 2003, EMBO reports.

[27]  C. Orengo,et al.  Protein function prediction--the power of multiplicity. , 2009, Trends in biotechnology.

[28]  Terri K. Attwood,et al.  PRINTS and its automatic supplement, prePRINTS , 2003, Nucleic Acids Res..

[29]  J. Skolnick,et al.  How well is enzyme function conserved as a function of pairwise sequence identity? , 2003, Journal of molecular biology.

[30]  D. Eisenberg,et al.  Detecting protein function and protein-protein interactions from genome sequences. , 1999, Science.

[31]  Zhou Yu,et al.  Ig-like domains on bacteriophages: a tale of promiscuity and deceit. , 2006, Journal of molecular biology.

[32]  B. Snel,et al.  Conservation of gene order: a fingerprint of proteins that physically interact. , 1998, Trends in biochemical sciences.

[33]  B. Rost Enzyme function less conserved than anticipated. , 2002, Journal of molecular biology.

[34]  Michael Y. Galperin,et al.  Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption , 1998, Silico Biol..

[35]  Tipton Kf,et al.  Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB). Enzyme nomenclature. Recommendations 1992. Supplement: corrections and additions. , 1994 .

[36]  Shoshana D. Brown,et al.  A gold standard set of mechanistically diverse enzyme superfamilies , 2006, Genome Biology.

[37]  C. Ouzounis,et al.  Errors in Genome Reviews , 1998, Science.

[38]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[39]  A Bairoch,et al.  Go hunting in sequence databases but watch out for the traps. , 1996, Trends in genetics : TIG.

[40]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[41]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[42]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[43]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[44]  P. Babbitt,et al.  Divergent evolution of enzymatic function: mechanistically diverse superfamilies and functionally distinct suprafamilies. , 2001, Annual review of biochemistry.

[45]  Richard D. Smith,et al.  Whole proteome analysis of post-translational modifications: applications of mass-spectrometry for proteogenomic annotation. , 2007, Genome research.

[46]  Richard Llewellyn,et al.  Annotating proteins with generalized functional linkages , 2008, Proceedings of the National Academy of Sciences.

[47]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[48]  Walter R. Gilks,et al.  Modeling the percolation of annotation errors in a database of protein sequences , 2002, Bioinform..

[49]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[50]  Ranyee A. Chiang,et al.  Evolution of structure and function in the o-succinylbenzoate synthase/N-acylamino acid racemase family of the enolase superfamily. , 2006, Journal of molecular biology.

[51]  L. L. Lloyd,et al.  Enzyme nomenclature — Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology: Academic Press Ltd, London, UK, 1992. xiii + 862 pp. Price £40.00. ISBN 0-12-227165-3 , 1994 .

[52]  C. Ouzounis,et al.  Whole‐genome sequence annotation: ‘Going wrong with confidence’ , 1999, Molecular microbiology.

[53]  Yoshihiro Yamanishi,et al.  KEGG for linking genomes to life and the environment , 2007, Nucleic Acids Res..

[54]  Dmitrij Frishman,et al.  Protein annotation at genomic scale: the current status. , 2007, Chemical reviews.

[55]  Peter D. Karp,et al.  Multidimensional annotation of the Escherichia coli K-12 genome , 2007, Nucleic acids research.

[56]  Duncan P. Brown,et al.  Functional Classification Using Phylogenomic Inference , 2006, PLoS Comput. Biol..

[57]  R Edwards,et al.  Cloning and characterization of glyoxalase I from soybean. , 2000, Archives of biochemistry and biophysics.

[58]  M. Pallen,et al.  ‘Going wrong with confidence’: misleading sequence analyses of CiaB and ClpX , 1999, Molecular microbiology.

[59]  Conrad C. Huang,et al.  Representing Structure-Function Relationships in Mechanistically Diverse Enzyme Superfamilies , 2004, Pacific Symposium on Biocomputing.

[60]  Janet M. Thornton,et al.  The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data , 2004, Nucleic Acids Res..

[61]  Thomas E. Ferrin,et al.  Using Sequence Similarity Networks for Visualization of Relationships Across Diverse Protein Superfamilies , 2009, PloS one.

[62]  J. Raes,et al.  Quantitative assessment of protein function prediction from metagenomics shotgun sequences , 2007, Proceedings of the National Academy of Sciences.

[63]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): mouse biology and model systems , 2007, Nucleic Acids Res..

[64]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[65]  Kara Dolinski,et al.  Gene Ontology annotations at SGD: new data sources and annotation methods , 2007, Nucleic Acids Res..

[66]  Samuel V. Angiuoli,et al.  Toward an online repository of Standard Operating Procedures (SOPs) for (meta)genomic annotation. , 2008, Omics : a journal of integrative biology.

[67]  Amos Bairoch,et al.  The PROSITE database , 2005, Nucleic Acids Res..

[68]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.