Evaluating Word Representation Features in Biomedical Named Entity Recognition Tasks

Biomedical Named Entity Recognition (BNER), which extracts important entities such as genes and proteins, is a crucial step of natural language processing in the biomedical domain. Various machine learning-based approaches have been applied to BNER tasks and showed good performance. In this paper, we systematically investigated three different types of word representation (WR) features for BNER, including clustering-based representation, distributional representation, and word embeddings. We selected one algorithm from each of the three types of WR features and applied them to the JNLPBA and BioCreAtIvE II BNER tasks. Our results showed that all the three WR algorithms were beneficial to machine learning-based BNER systems. Moreover, combining these different types of WR features further improved BNER performance, indicating that they are complementary to each other. By combining all the three types of WR features, the improvements in F-measure on the BioCreAtIvE II GM and JNLPBA corpora were 3.75% and 1.39%, respectively, when compared with the systems using baseline features. To the best of our knowledge, this is the first study to systematically evaluate the effect of three different types of WR features for BNER tasks.

[1]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[2]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[3]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[6]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[7]  Hua Xu,et al.  Clinical entity recognition using structural support vector machines with rich features , 2012, DTMBIO '12.

[8]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[9]  Luo Si,et al.  Boosting performance of bio-entity recognition by combining results from multiple systems , 2005, BIOKDD.

[10]  Hua Xu,et al.  Recognizing clinical entities in hospital discharge summaries using Structural Support Vector Machines with word representation features , 2013, BMC Medical Informatics and Decision Making.

[11]  Koby Crammer,et al.  Penn/Umass/CHOP Biocreative II systems , 2007 .

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Paolo Rosso,et al.  Conditional Random Fields vs. Hidden Markov Models in a biomedical Named Entity Recognition task , 2007 .

[14]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[15]  Feng Liu,et al.  Named Entity Recognition in Biomedical Literature: A Comparison of Support Vector Machines and Conditional Random Fields , 2007, ICEIS.

[16]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[17]  Hua Xu,et al.  A hybrid system for temporal information extraction from clinical text , 2013, J. Am. Medical Informatics Assoc..

[18]  Ulf Leser,et al.  What makes a gene name? Named entity recognition in the biomedical literature , 2005, Briefings Bioinform..

[19]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[20]  R. Gaizauskas,et al.  Term Recognition and Classification in Biological Science Journal Articles , 1998 .

[21]  Yoshua Bengio,et al.  Word Representations: A Simple and General Method for Semi-Supervised Learning , 2010, ACL.

[22]  U. M. Feyyad Data mining and knowledge discovery: making sense out of data , 1996 .

[23]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[24]  Yanjun Qi,et al.  Semi-supervised Bio-named Entity Recognition with Word-Codebook Learning , 2010, SDM.

[25]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Analysis , 1999, UAI.

[26]  Curt Burgess,et al.  Producing high-dimensional semantic spaces from lexical co-occurrence , 1996 .

[27]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[28]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  Robert Ho Canonical Correlation Analysis , 2013 .

[30]  Rie Kubota Ando,et al.  BioCreative II Gene Mention Tagging System at IBM Watson , 2007 .

[31]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[32]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[33]  Shaojun Zhao,et al.  Named Entity Recognition in Biomedical Texts using an HMM Model , 2004, NLPBA/BioNLP.

[34]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[35]  Hongfang Liu,et al.  BioThesaurus: a web-based thesaurus of protein and gene names , 2006, Bioinform..

[36]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[37]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[38]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..