Curation accuracy of model organism databases

Manual extraction of information from the biomedical literature—or biocuration—is the central methodology used to construct many biological databases. For example, the UniProt protein database, the EcoCyc Escherichia coli database and the Candida Genome Database (CGD) are all based on biocuration. Biological databases are used extensively by life science researchers, as online encyclopedias, as aids in the interpretation of new experimental data and as golden standards for the development of new bioinformatics algorithms. Although manual curation has been assumed to be highly accurate, we are aware of only one previous study of biocuration accuracy. We assessed the accuracy of EcoCyc and CGD by manually selecting curated assertions within randomly chosen EcoCyc and CGD gene pages and by then validating that the data found in the referenced publications supported those assertions. A database assertion is considered to be in error if that assertion could not be found in the publication cited for that assertion. We identified 10 errors in the 633 facts that we validated across the two databases, for an overall error rate of 1.58%, and individual error rates of 1.82% for CGD and 1.40% for EcoCyc. These data suggest that manual curation of the experimental literature by Ph.D-level scientists is highly accurate. Database URL: http://ecocyc.org/, http://www.candidagenome.org//

[1]  J. Michael Cherry,et al.  CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations , 2012, Database J. Biol. Databases Curation.

[2]  Marek S. Skrzypek,et al.  The Candida Genome Database: The new homology information page highlights protein similarity and phylogeny , 2013, Nucleic Acids Res..

[3]  Ian M. Donaldson,et al.  Literature curation of protein interactions: measuring agreement across major public databases , 2010, Database J. Biol. Databases Curation.

[4]  Ute Baumann,et al.  Estimating the annotation error rate of curated GO database sequence annotations , 2007, BMC Bioinformatics.

[5]  Alexander Gammerman,et al.  Sequence alignment kernel for recognition of promoter regions , 2003, Bioinform..

[6]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[7]  M. Vidal,et al.  Literature-curated protein interaction datasets , 2009, Nature Methods.

[8]  Henning Hermjakob,et al.  Mapping Plant Interactomes Using Literature Curated and Predicted Protein–Protein Interaction Data Sets[W] , 2010, Plant Cell.

[9]  Alexander Tropsha,et al.  Trust, But Verify: On the Importance of Chemical Structure Curation in Cheminformatics and QSAR Modeling Research , 2010, J. Chem. Inf. Model..

[10]  Riccardo Percudani,et al.  Ureidoglycolate hydrolase, amidohydrolase, lyase: how errors in biological databases are incorporated in scientific papers and vice versa , 2013, Database J. Biol. Databases Curation.

[11]  Illés J. Farkas,et al.  Uniformly curated signaling pathways reveal tissue-specific cross-talks and support drug target discovery , 2010, Bioinform..

[12]  Katherine H. Huang,et al.  A novel method for accurate operon predictions in all sequenced prokaryotes , 2005, Nucleic acids research.

[13]  S. Brenner Errors in genome annotation. , 1999, Trends in genetics : TIG.

[14]  Sean Ekins,et al.  Towards a gold standard: regarding quality in public domain chemistry databases and approaches to improving the situation. , 2012, Drug discovery today.

[15]  Akash Ranjan,et al.  Effect of Reference Genome Selection on the Performance of Computational Methods for Genome-Wide Protein-Protein Interaction Prediction , 2012, PloS one.

[16]  H. Chandler Database , 1985 .

[17]  Michael S. Livstone,et al.  Recurated protein interaction datasets , 2009, Nature Methods.

[18]  Patricia C. Babbitt,et al.  Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies , 2009, PLoS Comput. Biol..

[19]  Johannes Goll,et al.  Protein interaction data curation: the International Molecular Exchange (IMEx) consortium , 2012, Nature Methods.

[20]  Peter D. Karp,et al.  Construction and completion of flux balance models from pathway databases , 2012, Bioinform..

[21]  Peter D. Karp,et al.  EcoCyc: fusing model organism databases with systems biology , 2012, Nucleic Acids Res..

[22]  M. Vidal,et al.  Literature-curated protein interaction , 2009 .