Overview of BioCreative II gene mention recognition

Nineteen teams presented results for the Gene Mention Task at the BioCreative II Workshop. In this task participants designed systems to identify substrings in sentences corresponding to gene name mentions. A variety of different methods were used and the results varied with a highest achieved F1 score of 0.8721. Here we present brief descriptions of all the methods used and a statistical analysis of the results. We also demonstrate that, by combining the results from all submissions, an F score of 0.9066 is feasible, and furthermore that the best result makes use of the lowest scoring submissions.

[1]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[2]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[3]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[4]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[5]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[6]  Steve Renals,et al.  Proceedings of the Ninth Text REtrieval Conference , 2001 .

[7]  Lluís Màrquez i Villodre,et al.  Boosting Trees for Anti-Spam Email Filtering , 2001, ArXiv.

[8]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[9]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[12]  Andrew McCallum,et al.  Efficiently Inducing Features of Conditional Random Fields , 2002, UAI.

[13]  Koby Crammer,et al.  Online Passive-Aggressive Algorithms , 2003, J. Mach. Learn. Res..

[14]  Preslav Nakov,et al.  BioText Team Report for the TREC 2003 Genomics Track , 2003, TREC.

[15]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[16]  Tobias Scheffer,et al.  Multi-Relational Learning, Text Mining, and Semi-Supervised Learning for Functional Genomics , 2004, Machine Learning.

[17]  R. Zimmer,et al.  ProMiner: Organism-specific protein name detection using approximate string matching , 2004 .

[18]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[19]  Fabrizio Sebastiani,et al.  Organizing Digital Libraries by Automated Text Categorization , 2004 .

[20]  Thomas C. Rindflesch,et al.  MedTag: A Collection of Biomedical Annotations , 2005, LBLODMBS@IDMB.

[21]  Ellen M. Voorhees,et al.  Overview of TREC 2005 , 2005, TREC.

[22]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[23]  Edward Marcotte,et al.  Linking Biological Literature , Ontologies and Databases : Mining Biological Semantics , 2005 .

[24]  Tong Zhang,et al.  A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , 2005, J. Mach. Learn. Res..

[25]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[26]  Raymond J. Mooney,et al.  Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing , 2005 .

[27]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[28]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[29]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[30]  Hitoshi Isahara,et al.  Chinese Named Entity Recognition with Conditional Random Fields , 2006, SIGHAN@COLING/ACL.

[31]  Bob Carpenter Character Language Models for Chinese Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[32]  Manuel J. Maña López,et al.  Attribute analysis in biomedical text classification , 2007 .

[33]  Martin Hofmann-Apitius,et al.  Named Entity Recognition with Combinations of Conditional Random Fields , 2007 .

[34]  Cheng-Ju Kuo,et al.  High-Recall Gene Mention Recognition by Unification of Multiple Backward Parsing Models , 2007 .

[35]  Cheng-Ju Kuo,et al.  Rich Feature Set, Unification of Bidirectional Parsing and Dictionary Filtering for High F-Score Gene Mention Tagging. , 2007 .

[36]  Koby Crammer,et al.  Penn/Umass/CHOP Biocreative II systems , 2007 .

[37]  S. Katrenko,et al.  Using Semi-Supervised Techniques to Detect Gene Mentions , 2007 .

[38]  Feng Liu,et al.  Improving the Performance of Gene Mention Recognition System using Reformed Lexicon-based Support Vector Machine , 2007, DMIN.

[39]  Preslav Nakov,et al.  BioText Report for the Second BioCreAtIvE Challenge , 2007 .

[40]  B. Carpenter,et al.  LingPipe for 99.99% Recall of Gene Mentions , 2007 .

[41]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[42]  Richard J. Povinelli,et al.  Combined Conditional Random Fields and n -Gram Language Models for Gene Mention Recognition , 2007 .

[43]  Claire Grover,et al.  Adapting a Relation Extraction Pipeline for the BioCreAtIvE II Tasks , 2007 .

[44]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[45]  E. F. Tjong Kim Sang,et al.  Proceedings of CoNLL-2009 , 2009, ACL 2009.