BC4GO: a full-text corpus for the BioCreative IV GO task

Gene function curation via Gene Ontology (GO) annotation is a common task among Model Organism Database groups. Owing to its manual nature, this task is considered one of the bottlenecks in literature curation. There have been many previous attempts at automatic identification of GO terms and supporting information from full text. However, few systems have delivered an accuracy that is comparable with humans. One recognized challenge in developing such systems is the lack of marked sentence-level evidence text that provides the basis for making GO annotations. We aim to create a corpus that includes the GO evidence text along with the three core elements of GO annotations: (i) a gene or gene product, (ii) a GO term and (iii) a GO evidence code. To ensure our results are consistent with real-life GO data, we recruited eight professional GO curators and asked them to follow their routine GO annotation protocols. Our annotators marked up more than 5000 text passages in 200 articles for 1356 distinct GO terms. For evidence sentence selection, the inter-annotator agreement (IAA) results are 9.3% (strict) and 42.7% (relaxed) in F1-measures. For GO term selection, the IAAs are 47% (strict) and 62.9% (hierarchical). Our corpus analysis further shows that abstracts contain ∼10% of relevant evidence sentences and 30% distinct GO terms, while the Results/Experiment section has nearly 60% relevant sentences and >70% GO terms. Further, of those evidence sentences found in abstracts, less than one-third contain enough experimental detail to fulfill the three core criteria of a GO annotation. This result demonstrates the need of using full-text articles for text mining GO annotations. Through its use at the BioCreative IV GO (BC4GO) task, we expect our corpus to become a valuable resource for the BioNLP research community. Database URL: http://www.biocreative.org/resources/corpora/bc-iv-go-task-corpus/.

[1]  Zhiyong Lu,et al.  Semi-automatic semantic annotation of PubMed queries: A study on quality, efficiency, satisfaction , 2011, J. Biomed. Informatics.

[2]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[3]  Zhiyong Lu,et al.  BioCreative-2012 Virtual Issue , 2012, Database J. Biol. Databases Curation.

[4]  Rachael P. Huntley,et al.  The GOA database in 2009—an integrated Gene Ontology Annotation resource , 2008, Nucleic Acids Res..

[5]  Kimberly Van Auken,et al.  Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation , 2009, BMC Bioinformatics.

[6]  Tanya Z Berardini,et al.  Representing ontogeny through ontology: A developmental biologist's guide to the gene ontology , 2010, Molecular reproduction and development.

[7]  Zhiyong Lu,et al.  GO Molecular Function Terms Are Predictive of Subcellular Localization , 2004, Pacific Symposium on Biocomputing.

[8]  Zhiyong Lu,et al.  Accelerating literature curation with text-mining tools: a case study of using PubTator to curate genes in PubMed abstracts , 2012, Database J. Biol. Databases Curation.

[9]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[10]  Prudence Mutowo-Meullenet,et al.  Manual GO annotation of predictive protein signatures: the InterPro approach to GO curation , 2012, Database J. Biol. Databases Curation.

[11]  Zhiyong Lu,et al.  Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II , 2012, Database J. Biol. Databases Curation.

[12]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[13]  Helen L. Johnson,et al.  Concept recognition for extracting protein interaction relations from biomedical text , 2008, Genome Biology.

[14]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[15]  J. Michael Cherry,et al.  Using computational predictions to improve literature-based Gene Ontology annotations: a feasibility study , 2011, Database J. Biol. Databases Curation.

[16]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[17]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[18]  Anton Yuryev,et al.  Automatic extraction of gene ontology annotation and its correlation with clusters in protein networks , 2007, BMC Bioinformatics.

[19]  Zhiyong Lu,et al.  Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE , 2012, Database J. Biol. Databases Curation.

[20]  Prudence Mutowo-Meullenet,et al.  Use of Gene Ontology Annotation to understand the peroxisome proteome in humans , 2013, Database J. Biol. Databases Curation.

[21]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[22]  K. Bretonnel Cohen,et al.  Concept annotation in the CRAFT corpus , 2012, BMC Bioinformatics.

[23]  Toshihisa Takagi,et al.  Data and text mining Automatic extraction of gene / protein biological functions from biomedical text , 2005 .

[24]  Kimberly Van Auken,et al.  A guide to best practices for Gene Ontology (GO) manual annotation , 2013, Database J. Biol. Databases Curation.

[25]  Sophia Ananiadou,et al.  Argo: an integrative, interactive, text mining-based workbench supporting curation , 2012, Database J. Biol. Databases Curation.

[26]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[27]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[28]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[29]  Sophia Ananiadou,et al.  BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing , 2012 .

[30]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[31]  Jeyakumar Natarajan,et al.  An overview of the BioCreative 2012 Workshop Track III: interactive text mining task , 2013, Database J. Biol. Databases Curation.

[32]  Michael F Ochs,et al.  Incorporation of gene ontology annotations to enhance microarray data analysis. , 2007, Methods in molecular biology.

[33]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[34]  M E Funk,et al.  Indexing consistency in MEDLINE. , 1983, Bulletin of the Medical Library Association.

[35]  A. Valencia,et al.  Overview of the protein-protein interaction annotation extraction task of BioCreative II , 2008, Genome Biology.

[36]  Patrick Ruch,et al.  BiTeM/SIBtex group proceedings for BioCreative IV, Track 4 , 2013 .

[37]  Tanya Z. Berardini,et al.  Building an efficient curation workflow for the Arabidopsis literature corpus , 2012, Database J. Biol. Databases Curation.

[38]  Zhiyong Lu,et al.  Recommending MeSH terms for annotating biomedical articles , 2011, J. Am. Medical Informatics Assoc..

[39]  Zhiyong Lu,et al.  Extraction of data deposition statements from the literature: a method for automatically tracking research results , 2011, Bioinform..

[40]  Zhiyong Lu,et al.  An improved corpus of disease mentions in PubMed citations , 2012, BioNLP@HLT-NAACL.

[41]  H. Chandler Database , 1985 .

[42]  Zhiyong Lu,et al.  - like interactive curation system for document triage and literature curation , 2012 .

[43]  Gultekin Özsoyoglu,et al.  Discovering gene annotations in biomedical text databases , 2008, BMC Bioinformatics.

[44]  Steven J. M. Jones,et al.  Text-mining assisted regulatory annotation , 2008, Genome Biology.

[45]  Patrick Ruch,et al.  Managing the data deluge: data-driven GO category assignment improves while complexity of functional annotation increases , 2013, Database J. Biol. Databases Curation.

[46]  J. Michael Cherry,et al.  CvManGO, a method for leveraging computational predictions to improve literature-based Gene Ontology annotations , 2012, Database J. Biol. Databases Curation.

[47]  Duane Szafron,et al.  Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.