Disease named entity recognition using semisupervised learning and conditional random fields

Information extraction is an important text-mining task that aims at extracting prespecified types of information from large text collections and making them available in structured representations such as databases. In the biomedical domain, information extraction can be applied to help biologists make the most use of their digital-literature archives. Currently, there are large amounts of biomedical literature that contain rich information about biomedical substances. Extracting such knowledge requires a good named entity recognition technique. In this article, we combine conditional random fields (CRFs), a state-of-the-art sequence-labeling algorithm, with two semisupervised learning techniques, bootstrapping and feature sampling, to recognize disease names from biomedical literature. Two data-processing strategies for each technique also were analyzed: one sequentially processing unlabeled data partitions and another one processing unlabeled data partitions in a round-robin fashion. The experimental results showed the advantage of semisupervised learning techniques given limited labeled training data. Specifically, CRFs with bootstrapping implemented in sequential fashion outperformed strictly supervised CRFs for disease name recognition. The project was supported by NIH/NLM Grant R33 LM07299–01, 2002–2005. © 2011 Wiley Periodicals, Inc.

[1]  Teruyoshi Hishiki,et al.  Extraction of Gene-Disease Relations from Medline Using Domain Dictionaries and Machine Learning , 2005, Pacific Symposium on Biocomputing.

[2]  Thomas Hofmann,et al.  Large Margin Methods for Structured and Interdependent Output Variables , 2005, J. Mach. Learn. Res..

[3]  Michele Banko,et al.  Scaling to Very Very Large Corpora for Natural Language Disambiguation , 2001, ACL.

[4]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[5]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[6]  Andrew McCallum,et al.  Reducing Weight Undertraining in Structured Discriminative Learning , 2006, NAACL.

[7]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[8]  Marcos André Gonçalves,et al.  A flexible approach for extracting metadata from bibliographic citations , 2009, J. Assoc. Inf. Sci. Technol..

[9]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[10]  Wen-Lian Hsu,et al.  NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition , 2006, BMC Bioinformatics.

[11]  Burr Settles ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text , 2005 .

[12]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[13]  Hsinchun Chen,et al.  Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts , 2005, J. Assoc. Inf. Sci. Technol..

[14]  Thanaruk Theeramunkong,et al.  Multidimensional text classification for drug information , 2004, EMBC 2004.

[15]  Avrim Blum,et al.  The Bottleneck , 2021, Monopsony Capitalism.

[16]  Zhi-Hua Zhou,et al.  SETRED: Self-training with Editing , 2005, PAKDD.

[17]  Heng Ji,et al.  Data Selection in Semi-supervised Learning for Name Tagging , 2006 .

[18]  Concetto Spampinato,et al.  Discovering Genes-Diseases Associations From Specialized Literature Using the Grid , 2009, IEEE Transactions on Information Technology in Biomedicine.

[19]  Kentaro Torisawa,et al.  Inducing Gazetteers for Named Entity Recognition by Large-Scale Clustering of Dependency Relations , 2008, ACL.

[20]  Wanda Pratt,et al.  A Study of Biomedical Concept Identification: MetaMap vs. People , 2003, AMIA.

[21]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  Elaine Marsh,et al.  MUC-7 Evaluation of IE Technology: Overview of Results , 1998, MUC.

[23]  Samarth Keshava A Simpler , Intuitive Approach to Morpheme Induction , 2006 .

[24]  Nitesh V. Chawla,et al.  Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains , 2011, J. Artif. Intell. Res..

[25]  Nancy Chinchor,et al.  Overview of MUC-7 , 1998, MUC.

[26]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[27]  Gary Geunbae Lee,et al.  POSBIOTM-NER: a trainable biomedical named-entity recognition system , 2005, Bioinform..

[28]  Richard Tzong-Han Tsai,et al.  UvA-DARE ( Digital Academic Repository ) Overview of BioCreative II gene mention recognition , 2008 .

[29]  Ralph Grishman,et al.  Bootstrapped Learning of Semantic Classes from Positive and Negative Examples , 2003 .

[30]  Claire Cardie,et al.  Weakly Supervised Natural Language Learning Without Redundant Views , 2003, NAACL.

[31]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[32]  Lyle H. Ungar,et al.  Automatic term list generation for entity tagging , 2006, Bioinform..

[33]  Zhu Zhang,et al.  Mining relational data from text: From strictly supervised to weakly supervised learning , 2008, Inf. Syst..

[34]  Ralph Grishman,et al.  Design of the MUC-6 evaluation , 1995, MUC.

[35]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[36]  Dale Schuurmans,et al.  Semi-Supervised Conditional Random Fields for Improved Sequence Segmentation and Labeling , 2006, ACL.

[37]  Rayid Ghani,et al.  Analyzing the effectiveness and applicability of co-training , 2000, CIKM '00.

[38]  Andreas Vlachos,et al.  Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain , 2006, BioNLP@NAACL-HLT.

[39]  Carol Friedman,et al.  Generating Executable Knowledge for Evidence-Based Medicine Using Natural Language and Semantic Processing , 2006, AMIA.

[40]  Ralph Grishman,et al.  Updating a Name Tagger Using Contemporary Unlabeled Data , 2009, ACL/IJCNLP.

[41]  Patrick Schone,et al.  Mining Wiki Resources for Multilingual Named Entity Recognition , 2008, ACL.

[42]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[43]  Jiexun Li,et al.  Kernel-based learning for biomedical relation extraction , 2008 .

[44]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[45]  A. Rivas,et al.  Discovering Novel Causal Patterns From Biomedical Natural-Language Texts Using Bayesian Nets , 2008, IEEE Transactions on Information Technology in Biomedicine.

[46]  Philip Resnik,et al.  Elements of a computational model for multi-party discourse: The turn-taking behavior of Supreme Court justices , 2009 .

[47]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[48]  Gondy Leroy,et al.  Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts: Research Articles , 2005 .

[49]  Nigel Collier,et al.  Use of Support Vector Machines in Extended Named Entity Recognition , 2002, CoNLL.