Enhancing medical named entity recognition with an extended segment representation technique

OBJECTIVE The objective of this paper is to formulate an extended segment representation (SR) technique to enhance named entity recognition (NER) in medical applications. METHODS An extension to the IOBES (Inside/Outside/Begin/End/Single) SR technique is formulated. In the proposed extension, a new class is assigned to words that do not belong to a named entity (NE) in one context but appear as an NE in other contexts. Ambiguity in such cases can negatively affect the results of classification-based NER techniques. Assigning a separate class to words that can potentially cause ambiguity in NER allows a classifier to detect NEs more accurately; therefore increasing classification accuracy. RESULTS The proposed SR technique is evaluated using the i2b2 2010 medical challenge data set with eight different classifiers. Each classifier is trained separately to extract three different medical NEs, namely treatment, problem, and test. From the three experimental results, the extended SR technique is able to improve the average F1-measure results pertaining to seven out of eight classifiers. The kNN classifier shows an average reduction of 0.18% across three experiments, while the C4.5 classifier records an average improvement of 9.33%.

[1]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[2]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[3]  Paloma Martínez,et al.  The DDI corpus: An annotated corpus with pharmacological substances and drug-drug interactions , 2013, J. Biomed. Informatics.

[4]  Erik F. Tjong Kim Sang,et al.  Noun Phrase Recognition by System Combination , 2000, ANLP.

[5]  Hongfang Liu,et al.  Using machine learning for concept extraction on clinical documents from multiple data sources , 2011, J. Am. Medical Informatics Assoc..

[6]  Naoaki Okazaki,et al.  Named entity recognition with multiple segment representations , 2013, Inf. Process. Manag..

[7]  Hongfei Lin,et al.  Drug name recognition in biomedical texts: a machine-learning-based method. , 2014, Drug discovery today.

[8]  Lishuang Li,et al.  Two-phase biomedical named entity recognition using CRFs , 2009, Comput. Biol. Chem..

[9]  Cheng Zhang,et al.  Biomedical text mining and its applications in cancer research , 2013, J. Biomed. Informatics.

[10]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[11]  Josette F. Jones,et al.  Knowledge Discovery and Data Mining of Free Text Radiology Reports , 2011, 2011 IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology.

[12]  Halil Kilicoglu,et al.  Using semantic predications to uncover drug-drug interactions in clinical data , 2014, J. Biomed. Informatics.

[13]  Pierre Zweigenbaum,et al.  Hybrid methods for improving information access in clinical documents: concept, assertion, and relation identification , 2011, J. Am. Medical Informatics Assoc..

[14]  Asif Ekbal,et al.  Stacked ensemble coupled with feature selection for biomedical entity extraction , 2013, Knowl. Based Syst..

[15]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[16]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[17]  Juan José Rodríguez Diez,et al.  Boosting recombined weak classifiers , 2008, Pattern Recognit. Lett..

[18]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[19]  Yu-Chieh Wu A top-down information theoretic word clustering algorithm for phrase recognition , 2014, Inf. Sci..

[20]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[21]  David W. Aha,et al.  Instance-Based Learning Algorithms , 1991, Machine Learning.

[22]  Cynthia Brandt,et al.  Web-based UMLS concept retrieval by automatic text scanning: a comparison of two methods , 2001, Comput. Methods Programs Biomed..

[23]  Hyoil Han,et al.  Biomedical question answering: A survey , 2010, Comput. Methods Programs Biomed..

[24]  Fei Zhu,et al.  Named Entity Recognition from Biomedical Text Using SVM , 2011, 2011 5th International Conference on Bioinformatics and Biomedical Engineering.

[25]  Bairong Shen,et al.  Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing , 2012, PloS one.

[26]  Dan Klein,et al.  Optimization, Maxent Models, and Conditional Estimation without Magic , 2003, NAACL.

[27]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[28]  Ming Zhou,et al.  Two-stage NER for tweets with clustering , 2013, Inf. Process. Manag..

[29]  Masaki Murata,et al.  Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules , 2000, ACL.

[30]  Jari Björne,et al.  UTurku: Drug Named Entity Recognition and Drug-Drug Interaction Extraction Using SVM Classification and Domain Knowledge , 2013, *SEMEVAL.

[31]  Pierre Zweigenbaum,et al.  Extracting medical information from narrative patient records: the case of medication-related information , 2010, J. Am. Medical Informatics Assoc..

[32]  John Atkinson,et al.  A multi-strategy approach to biological named entity recognition , 2012, Expert Syst. Appl..

[33]  Erik M. van Mulligen,et al.  Using an ensemble system to improve concept extraction from clinical records , 2012, J. Biomed. Informatics.

[34]  Levent Özgür,et al.  Text Categorization with Class-Based and Corpus-Based Keyword Selection , 2005, ISCIS.

[35]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[36]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[37]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[38]  A Hasman,et al.  Automatic SNOMED classification--a corpus-based method. , 1997, Computer methods and programs in biomedicine.

[39]  Sanda M. Harabagiu,et al.  A flexible framework for deriving assertions from electronic medical records , 2011, J. Am. Medical Informatics Assoc..

[40]  Mónica Marrero,et al.  Named Entity Recognition: Fallacies, challenges and opportunities , 2013, Comput. Stand. Interfaces.

[41]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[42]  Florentino Fernández Riverola,et al.  BioAnnote: A software platform for annotating biomedical documents with application in medical learning environments , 2013, Comput. Methods Programs Biomed..

[43]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[44]  Balazs Godény Rule Based Product Name Recognition and Disambiguation , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[45]  Paolo Rosso,et al.  Towards a Protein-Protein Interaction information extraction system: Recognizing named entities , 2014, Knowl. Based Syst..

[46]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[47]  Erik F. Tjong Kim Sang,et al.  Text Chunking by System Combination , 2000, CoNLL/LLL.

[48]  Bhuvana Ramabhadran,et al.  Creating ensemble of diverse maximum entropy models , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[49]  Yi Guan,et al.  Transfer learning based clinical concept extraction on data from multiple sources , 2014, J. Biomed. Informatics.

[50]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[51]  Jon Atli Benediktsson,et al.  Multiple Classifier Systems , 2015, Lecture Notes in Computer Science.

[52]  Rania A. Abul Seoud,et al.  TMT-HCC: A tool for text mining the biomedical literature for hepatocellular carcinoma (HCC) biomarkers identification , 2013, Comput. Methods Programs Biomed..

[53]  Maria Kvist,et al.  Automatic recognition of disorders, findings, pharmaceuticals and body structures from clinical text: An annotation and machine learning study , 2014, J. Biomed. Informatics.

[54]  Asif Ekbal,et al.  Combining multiple classifiers using vote based classifier ensemble technique for named entity recognition , 2013, Data Knowl. Eng..

[55]  Lyle H. Ungar,et al.  Identifying potential adverse effects using the web: A new approach to medical hypothesis generation , 2011, J. Biomed. Informatics.

[56]  Goran Nenadic,et al.  Challenges in Clinical Named Entity Recognition for Decision Support , 2013, 2013 IEEE International Conference on Systems, Man, and Cybernetics.

[57]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[58]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[59]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[60]  Siu Cheung Hui,et al.  Computational methods for Traditional Chinese Medicine: A survey , 2007, Comput. Methods Programs Biomed..