A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries

OBJECTIVE The authors' goal was to develop and evaluate machine-learning-based approaches to extracting clinical entities-including medical problems, tests, and treatments, as well as their asserted status-from hospital discharge summaries written using natural language. This project was part of the 2010 Center of Informatics for Integrating Biology and the Bedside/Veterans Affairs (VA) natural-language-processing challenge. DESIGN The authors implemented a machine-learning-based named entity recognition system for clinical text and systematically evaluated the contributions of different types of features and ML algorithms, using a training corpus of 349 annotated notes. Based on the results from training data, the authors developed a novel hybrid clinical entity extraction system, which integrated heuristic rule-based modules with the ML-base named entity recognition module. The authors applied the hybrid system to the concept extraction and assertion classification tasks in the challenge and evaluated its performance using a test data set with 477 annotated notes. MEASUREMENTS Standard measures including precision, recall, and F-measure were calculated using the evaluation script provided by the Center of Informatics for Integrating Biology and the Bedside/VA challenge organizers. The overall performance for all three types of clinical entities and all six types of assertions across 477 annotated notes were considered as the primary metric in the challenge. RESULTS AND DISCUSSION Systematic evaluation on the training set showed that Conditional Random Fields outperformed Support Vector Machines, and semantic information from existing natural-language-processing systems largely improved performance, although contributions from different types of features varied. The authors' hybrid entity extraction system achieved a maximum overall F-score of 0.8391 for concept extraction (ranked second) and 0.9313 for assertion classification (ranked fourth, but not statistically different than the first three systems) on the test data set in the challenge.

[1]  Nigel Collier,et al.  Bio-Medical Entity Extraction using Support Vector Machines , 2005, Artif. Intell. Medicine.

[2]  Ying He,et al.  Biological Entity Recognition with Conditional Random Fields , 2008, AMIA.

[3]  P J Haug,et al.  Experience with a mixed semantic/syntactic parser. , 1995, Proceedings. Symposium on Computer Applications in Medical Care.

[4]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[5]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[6]  Son Doan,et al.  Application of information technology: MedEx: a medication information extraction system for clinical narratives , 2010, J. Am. Medical Informatics Assoc..

[7]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[8]  Randolph A. Miller,et al.  Research Paper: Evaluation of a Method to Identify and Categorize Section Headers in Clinical Documents , 2009, J. Am. Medical Informatics Assoc..

[9]  J. Austin,et al.  Use of natural language processing to translate clinical information from a database of 889,921 chest radiographic reports. , 2002, Radiology.

[10]  Yuji Matsumoto,et al.  Use of Support Vector Learning for Chunk Identification , 2000, CoNLL/LLL.

[11]  Yang Huang,et al.  A Grammar-based Classification of Negations in Clinical Radiology Reports , 2005, AMIA.

[12]  Thorsten Joachims,et al.  Making large scale SVM learning practical , 1998 .

[13]  Dingcheng Li,et al.  Conditional Random Fields and Support Vector Machines for Disorder Named Entity Recognition in Clinical Texts , 2008, BioNLP.

[14]  Peter Szolovits,et al.  Syntactically-informed semantic category recognition in discharge summaries. , 2006, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[15]  Hongfang Liu,et al.  BioTagger-GM: a gene/protein name recognition system. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[16]  Jun'ichi Tsujii Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing: Shared Task , 2009 .

[17]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[18]  Peter Szolovits,et al.  Syntactically-Informed Semantic Category Recognizer for Discharge Summaries , 2006, AMIA.

[19]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[20]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[21]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[22]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[23]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[24]  W. DuMouchel,et al.  Unlocking Clinical Data from Narrative Reports: A Study of Natural Language Processing , 1995, Annals of Internal Medicine.

[26]  Randolph A. Miller,et al.  Development and Evaluation of a Clinical Note Section Header Terminology , 2008, AMIA.

[27]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[28]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[29]  Gary Geunbae Lee,et al.  POSBIOTM-NER: a trainable biomedical named-entity recognition system , 2005, Bioinform..

[30]  Jian Su,et al.  Recognition of protein/gene names from text using an ensemble of classifiers , 2005, BMC Bioinformatics.

[31]  Peter J. Haug,et al.  A natural language parsing system for encoding admitting diagnoses , 1997, AMIA.

[32]  Hong Yu,et al.  Lancet: a high precision medication event extraction system for clinical text , 2010, J. Am. Medical Informatics Assoc..

[33]  Scott T. Weiss,et al.  Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system , 2006, BMC Medical Informatics Decis. Mak..

[34]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[35]  Özlem Uzuner,et al.  Extracting medication information from clinical text , 2010, J. Am. Medical Informatics Assoc..

[36]  Thorsten Joachims,et al.  Text categorization with support vector machines , 1999 .

[37]  Daniel Hanisch,et al.  ProMiner: rule-based protein and gene entity recognition , 2005, BMC Bioinformatics.

[38]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[39]  Jörg Kindermann,et al.  Text Categorization with Support Vector Machines. How to Represent Texts in Input Space? , 2002, Machine Learning.

[40]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[41]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[42]  Peter J. Haug,et al.  Automatic identification of pneumonia related concepts on chest x-ray reports , 1999, AMIA.

[43]  Yuji Matsumoto,et al.  Protein Name Tagging for Biomedical Annotation in Text , 2003, BioNLP@ACL.