Text mining in the biocuration workflow: applications for literature curation at WormBase, dictyBase and TAIR

WormBase, dictyBase and The Arabidopsis Information Resource (TAIR) are model organism databases containing information about Caenorhabditis elegans and other nematodes, the social amoeba Dictyostelium discoideum and related Dictyostelids and the flowering plant Arabidopsis thaliana, respectively. Each database curates multiple data types from the primary research literature. In this article, we describe the curation workflow at WormBase, with particular emphasis on our use of text-mining tools (BioCreative 2012, Workshop Track II). We then describe the application of a specific component of that workflow, Textpresso for Cellular Component Curation (CCC), to Gene Ontology (GO) curation at dictyBase and TAIR (BioCreative 2012, Workshop Track III). We find that, with organism-specific modifications, Textpresso can be used by dictyBase and TAIR to annotate gene productions to GO's Cellular Component (CC) ontology.

[1]  Eric M. Just,et al.  dictyBase update 2011: web 2.0 functionality and the initial steps towards a genome portal for the Amoebozoa , 2010, Nucleic Acids Res..

[2]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[3]  J. Rashbass Online Mendelian Inheritance in Man. , 1995, Trends in genetics : TIG.

[4]  María Martín,et al.  The Gene Ontology: enhancements for 2011 , 2011, Nucleic Acids Res..

[5]  Kimberly Van Auken,et al.  Automatic categorization of diverse experimental information in the bioscience literature , 2012, BMC Bioinformatics.

[6]  Christoph Steinbeck,et al.  A database for chemical proteomics: ChEBI. , 2012, Methods in molecular biology.

[7]  Paul W. Sternberg,et al.  Worm Phenotype Ontology: Integrating phenotype data within and beyond the C. elegans community , 2011, BMC Bioinformatics.

[8]  Kimberly Van Auken,et al.  Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation , 2009, BMC Bioinformatics.

[9]  Raymond Y. N. Lee,et al.  Building a Cell and Anatomy Ontology of Caenorhabditis Elegans , 2003, Comparative and functional genomics.

[10]  D. Valle,et al.  Online Mendelian Inheritance In Man (OMIM) , 2000, Human mutation.

[11]  R. Durbin,et al.  The Sequence Ontology: a tool for the unification of genome annotations , 2005, Genome Biology.

[12]  Michael Y. Galperin,et al.  The 2012 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection , 2011, Nucleic Acids Res..

[13]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): improved gene annotation and new tools , 2011, Nucleic Acids Res..

[14]  Paul W. Sternberg,et al.  WormBook: the online review of Caenorhabditis elegans biology , 2006, Nucleic Acids Res..

[15]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[16]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[17]  Juancarlos Chan,et al.  Toward an interactive article: integrating journals and biological databases , 2010, BMC Bioinformatics.

[18]  Amarnath Gupta,et al.  Development and use of Ontologies Inside the Neuroscience Information Framework: A Practical Approach , 2012, Front. Gene..

[19]  K. Bretonnel Cohen,et al.  Manual curation is not sufficient for annotation of genomic databases , 2007, ISMB/ECCB.

[20]  Lincoln Stein,et al.  The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations , 2008, Nucleic Acids Res..

[21]  Kimberly Van Auken,et al.  WormBase 2012: more genomes, more data, new website , 2011, Nucleic Acids Res..

[22]  Steven J. Marygold,et al.  Directly e-mailing authors of newly published papers encourages community curation , 2012, Database J. Biol. Databases Curation.