Overview of BioCreAtIvE: critical assessment of information extraction for biology

BackgroundThe goal of the first BioCreAtIvE challenge (Critical Assessment of Information Extraction in Biology) was to provide a set of common evaluation tasks to assess the state of the art for text mining applied to biological problems. The results were presented in a workshop held in Granada, Spain March 28–31, 2004. The articles collected in this BMC Bioinformatics supplement entitled "A critical assessment of text mining methods in molecular biology" describe the BioCreAtIvE tasks, systems, results and their independent evaluation.ResultsBioCreAtIvE focused on two tasks. The first dealt with extraction of gene or protein names from text, and their mapping into standardized gene identifiers for three model organism databases (fly, mouse, yeast). The second task addressed issues of functional annotation, requiring systems to identify specific text passages that supported Gene Ontology annotations for specific proteins, given full text articles.ConclusionThe first BioCreAtIvE assessment achieved a high level of international participation (27 groups from 10 countries). The assessment provided state-of-the-art performance results for a basic task (gene name finding and normalization), where the best systems achieved a balanced 80% precision / recall or better, which potentially makes them suitable for real applications in biology. The results for the advanced task (functional annotation from free text) were significantly lower, demonstrating the current limitations of text-mining approaches where knowledge extrapolation and interpretation are required. In addition, an important contribution of BioCreAtIvE has been the creation and release of training and test data sets for both tasks. There are 22 articles in this special issue, including six that provide analyses of results or data quality for the data sets, including a novel inter-annotator consistency assessment for the test set used in task 2.

[1]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[2]  C. Ball,et al.  Saccharomyces Genome Database. , 2002, Methods in enzymology.

[3]  Limsoon Wong,et al.  Accomplishments and challenges in literature data mining for biology , 2002, Bioinform..

[4]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[5]  Marti A. Hearst,et al.  TREC 2007 Genomics Track Overview , 2007, TREC.

[6]  K. Bretonnel Cohen,et al.  BioCreAtIvE Task1A: entity identification with a stochastic tagger , 2005, BMC Bioinformatics.

[7]  Goran Nenadic,et al.  Mining protein function from text using term-based support vector machines , 2005, BMC Bioinformatics.

[8]  Alfonso Valencia,et al.  Evaluation of BioCreAtIvE assessment of task 2 , 2005, BMC Bioinformatics.

[9]  Ralf Zimmer,et al.  A simple approach for protein name identification: prospects and limits , 2005, BMC Bioinformatics.

[10]  Alexander A. Morgan,et al.  Data preparation and interannotator agreement: BioCreAtIvE Task 1B , 2005, BMC Bioinformatics.

[11]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[12]  Karin M. Verspoor,et al.  Protein annotation as term categorization in the gene ontology using word proximity networks , 2005, BMC Bioinformatics.

[13]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[14]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[15]  Ulf Leser,et al.  Systematic feature evaluation for gene name recognition , 2005, BMC Bioinformatics.

[16]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[17]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[18]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[19]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[20]  Fernando Pereira,et al.  Automatically annotating documents with normalized gene lists , 2005, BMC Bioinformatics.

[21]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[22]  Mark Craven,et al.  Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text , 2005, BMC Bioinformatics.

[23]  Javier Tamames,et al.  Text Detective: a rule-based system for gene annotation in biomedical texts , 2005, BMC Bioinformatics.

[24]  Alfonso Valencia,et al.  A sentence sliding window approach to extract protein annotations from biomedical articles , 2005, BMC Bioinformatics.

[25]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[26]  Patrick Ruch,et al.  Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot , 2005, BMC Bioinformatics.

[27]  Mário J. Silva,et al.  Finding genomic ontology terms in text using evidence content , 2005, BMC Bioinformatics.