Original article Text mining for the biocuration workflow

Molecular biology has become heavily dependent on biological knowledge encoded in expert curated biological databases.As the volume of biological literature increases, biocurators need help in keeping up with the literature; (semi-) automatedaids for biocuration would seem to be an ideal application for natural language processing and text mining. However, todate, there have been few documented successes for improving biocuration throughput using text mining. Our initialinvestigations took place for the workshop on ‘Text Mining for the BioCuration Workflow’ at the third InternationalBiocuration Conference (Berlin, 2009). We interviewed biocurators to obtain workflows from eight biological databases.This initial study revealed high-level commonalities, including (i) selection of documents for curation; (ii) indexing ofdocuments with biologically relevant entities (e.g. genes); and (iii) detailed curation of specific relations (e.g. interactions);however, the detailed workflows also showed many variabilities. Following the workshop, we conducted a survey ofbiocurators. The survey identified biocurator priorities, including the handling of full text indexed with biological entitiesand support for the identification and prioritization of documents for curation. It also indicated that two-thirds ofthe biocuration teams had experimented with text mining and almost half were using text mining at that time. Analysisof our interviews and survey provide a set of requirements for the integration of text mining into the biocuration work-flow. These can guide the identification of common needs across curated databases and encourage joint experimentationinvolving biocurators, text mining developers and the larger biomedical research community.

[1]  Philip E. Bourne,et al.  BioLit: integrating biological literature with databases , 2008, Nucleic Acids Res..

[2]  Judith A. Blake,et al.  Integrating text mining into the MGI biocuration workflow , 2009, Database J. Biol. Databases Curation.

[3]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[4]  Udo Hahn,et al.  Text mining: powering the database revolution , 2007, Nature.

[5]  Andreas Prlic,et al.  Integration of open access literature into the RCSB Protein Data Bank using BioLit , 2010, BMC Bioinformatics.

[6]  Lawrence Hunter,et al.  Using the Gene Ontology to Annotate Biomedical Journal Articles , 2009 .

[7]  Cathy H. Wu,et al.  Biocuration Workflow Catalogue , 2009 .

[8]  Luana Licata,et al.  Linking entries in protein interaction database to structured text: The FEBS Letters experiment , 2008, FEBS letters.

[9]  Miguel Rocha,et al.  Bringing Text Miners and Biologists Closer Together , 2009 .

[10]  Thomas M. Oinn,et al.  The Taverna Interaction Service: enabling manual interaction in workflows , 2008, Bioinform..

[11]  Lynette Hirschman,et al.  The FEBS Letters/BioCreative II.5 experiment: making biological information accessible , 2010, Nature Biotechnology.

[12]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[13]  Patrick Ruch,et al.  Text mining for Swiss-Prot curation: A story of success and failure , 2009 .

[14]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[15]  Martin Krallinger,et al.  A Framework for BioCuration Workflows (part II) , 2009 .

[16]  K. Bretonnel Cohen,et al.  Text mining and manual curation of chemical-gene-disease networks for the Comparative Toxicogenomics Database (CTD) , 2009, BMC Bioinformatics.

[17]  Carole A. Goble,et al.  Taverna: a tool for building and running workflows of services , 2006, Nucleic Acids Res..

[18]  Qing Zhang,et al.  Automating document classification for the Immune Epitope Database , 2007, BMC Bioinformatics.