Annotation Guidelines for Machine Learning-Based Named Entity Recognition in Microbiology

Recent challenges on machine learning application to named-entity recognition in biology trigger discussions on the manual annotation guidelines for annotating the learning corpora. Some sources of potential inconsistency have been identified by corpus annotators and challenge participants. We go one step further by proposing specific annotation guidelines for biology and evaluating their effect on performances of machine learning methods. We show that a significant improvement can be achieved by this way that is not due to the feature set neither to the ML methods.

[1]  Park,et al.  Developing NLP Tools for Genome Informatics: An Information Extraction Perspective. , 1998, Genome informatics. Workshop on Genome Informatics.

[2]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[3]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[5]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[6]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[7]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[8]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[9]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[10]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.

[11]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[12]  Jeffrey B. Colombe,et al.  Finding relevant references to genes and proteins in Medline using a Bayesian approach , 2002, Bioinform..

[13]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[14]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[15]  Nigel Collier,et al.  Comparison of character-level and part of speech features for name recognition in biomedical texts , 2004, J. Biomed. Informatics.

[16]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[17]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[18]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[19]  Nigel Collier,et al.  Exploring Predicate-Argument Relations for Named Entity Recognition in the Molecular Biology Domain , 2005, Discovery Science.

[20]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.

[21]  Malvina Nissim,et al.  A System for Identifying Named Entities in Biomedical Text: how Results From two Evaluations Reflect on Both the System and the Evaluations , 2005, Comparative and functional genomics.

[22]  Ted Briscoe,et al.  Bootstrapping the Recognition and Anaphoric Linking of Named Entities in Drosophila Articles , 2006, Pacific Symposium on Biocomputing.

[23]  Robert H. Baud,et al.  Recent advances in natural language processing for biomedical applications , 2006, Int. J. Medical Informatics.

[24]  Malvina Nissim,et al.  The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text , 2006, LREC.