Extracting Semantic Lexicons from Discharge Summaries using Machine Learning and the C-Value Method

Semantic lexicons that link words and phrases to specific semantic types such as diseases are valuable assets for clinical natural language processing (NLP) systems. Although terminological terms with predefined semantic types can be generated easily from existing knowledge bases such as the Unified Medical Language Systems (UMLS), they are often limited and do not have good coverage for narrative clinical text. In this study, we developed a method for building semantic lexicons from clinical corpus. It extracts candidate semantic terms using a conditional random field (CRF) classifier and then selects terms using the C-Value algorithm. We applied the method to a corpus containing 10 years of discharge summaries from Vanderbilt University Hospital (VUH) and extracted 44,957 new terms for three semantic groups: Problem, Treatment, and Test. A manual analysis of 200 randomly selected terms not found in the UMLS demonstrated that 59% of them were meaningful new clinical concepts and 25% were lexical variants of exiting concepts in the UMLS. Furthermore, we compared the effectiveness of corpus-derived and UMLS-derived semantic lexicons in the concept extraction task of the 2010 i2b2 clinical NLP challenge. Our results showed that the classifier with corpus-derived semantic lexicons as features achieved a better performance (F-score 82.52%) than that with UMLS-derived semantic lexicons as features (F-score 82.04%). We conclude that such corpus-based methods are effective for generating semantic lexicons, which may improve named entity recognition tasks and may aid in augmenting synonymy within existing terminologies.

[1]  Hongfang Liu,et al.  Evaluating the UMLS as a source of lexical knowledge for medical language processing , 2001, AMIA.

[2]  Alla Keselman,et al.  Term Identification Methods for Consumer Health Vocabulary Development , 2007, Journal of medical Internet research.

[3]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.

[4]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[5]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[6]  Anderson Spickard,et al.  Research Paper: "Understanding" Medical School Curriculum Content Using KnowledgeMap , 2003, J. Am. Medical Informatics Assoc..

[7]  Marc Overhage,et al.  An evaluation of the UMLS in representing corpus derived clinical concepts. , 2011, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[8]  Aziz A. Boxwala,et al.  Coverage of patient safety terms in the UMLS Metathesaurus , 2003, AMIA.

[9]  Karin M. Verspoor,et al.  Towards a Semantic Lexicon for Biological Language Processing , 2005, Comparative and functional genomics.

[10]  Hua Xu,et al.  A study of machine-learning-based approaches to extract clinical entities and their assertions from discharge summaries , 2011, J. Am. Medical Informatics Assoc..

[11]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[12]  D. Roden,et al.  Development of a Large‐Scale De‐Identified DNA Biobank to Enable Personalized Medicine , 2008, Clinical pharmacology and therapeutics.

[13]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[14]  Marcelo Fiszman,et al.  Identifying Risk Factors for Metabolic Syndrome in Biomedical Text , 2007, AMIA.

[15]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[16]  Stephen B. Johnson Research Paper: A Semantic Lexicon for Medical Language Processing , 1999, J. Am. Medical Informatics Assoc..

[17]  Stephanie W. Haas,et al.  Unified medical language system coverage of emergency-medicine chief complaints. , 2006, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[18]  Christopher G. Chute,et al.  A term extraction tool for expanding content in the domain of functioning, disability, and health: proof of concept , 2003, J. Biomed. Informatics.