Acronym Disambiguation in Clinical Notes from Electronic Health Records

Objective: The use of electronic health records (EHR) systems has grown over the past decade, and with it, the need to extract information from unstructured clinical narratives. Clinical notes, however, frequently contain acronyms with several potential senses (meanings) and traditional natural language processing (NLP) techniques cannot differentiate between these senses. In this study we introduce an unsupervised method for acronym disambiguation, the task of classifying the correct sense of acronyms in the clinical EHR notes. Methods: We developed an unsupervised ensemble machine learning (CASEml) algorithm to automatically classify acronyms by leveraging semantic embeddings, visit-level text and billing information. The algorithm was validated using note data from the Veterans Affairs hospital system to classify the meaning of three acronyms: RA, MS, and MI. We compared the performance of CASEml against another standard unsupervised method and a baseline metric selecting the most frequent acronym sense. We additionally evaluated the effects of RA disambiguation on NLP-driven phenotyping of rheumatoid arthritis. Results: CASEml achieved accuracies of 0.947, 0.911, and 0.706 for RA, MS, and MI, respectively, higher than a standard baseline metric and (on average) higher than a state-of-the-art unsupervised method. As well, we demonstrated that applying CASEml to medical notes improves the AUC of a phenotype algorithm for rheumatoid arthritis. Conclusion: CASEml is a novel method that accurately disambiguates acronyms in clinical notes and has advantages over commonly used supervised and unsupervised machine learning approaches. In addition, CASEml improves the performance of NLP tasks that rely on ambiguous acronyms, such as phenotyping.

[1]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[2]  Serguei V. S. Pakhomov,et al.  A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources , 2014, J. Am. Medical Informatics Assoc..

[3]  Serguei V. S. Pakhomov,et al.  Clinical Abbreviation Sense Inventory , 2012 .

[4]  I. Kohane,et al.  Electronic medical records for discovery research in rheumatoid arthritis , 2010, Arthritis care & research.

[5]  Peter Szolovits,et al.  High-throughput phenotyping with electronic medical record data using a common semi-supervised approach (PheCAP) , 2019, Nature Protocols.

[6]  Gerard Tromp,et al.  Design patterns for the development of electronic health record-driven phenotype extraction algorithms , 2014, J. Biomed. Informatics.

[7]  Reed McEwan,et al.  Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data , 2016, AMIA.

[8]  F. Leisch FlexMix: A general framework for finite mixture models and latent class regression in R , 2004 .

[9]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[10]  Mary Brophy,et al.  Million Veteran Program: A mega-biobank to study genetic influences on health and disease. , 2016, Journal of clinical epidemiology.

[11]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[12]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[13]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[14]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[15]  D. Opitz,et al.  Popular Ensemble Methods: An Empirical Study , 1999, J. Artif. Intell. Res..

[16]  Christian Wartena,et al.  Using Word Embeddings for Unsupervised Acronym Disambiguation , 2018, COLING.

[17]  Paul A. Harris,et al.  PheKB: a catalog and workflow for creating electronic phenotype algorithms for transportability , 2016, J. Am. Medical Informatics Assoc..

[18]  C. Chute,et al.  Electronic Medical Records for Genetic Research: Results of the eMERGE Consortium , 2011, Science Translational Medicine.

[19]  George Hripcsak,et al.  Deep Phenotyping on Electronic Health Records Facilitates Genetic Diagnosis by Clinical Exomes. , 2018, American journal of human genetics.

[20]  Omer Levy,et al.  Neural Word Embedding as Implicit Matrix Factorization , 2014, NIPS.

[21]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[22]  Bharath Dandala,et al.  A convolutional route to abbreviation disambiguation in clinical text , 2018, J. Biomed. Informatics.

[23]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[24]  Melissa A. Basford,et al.  Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data , 2013, Nature Biotechnology.

[25]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data , 2018, PSB.

[26]  Sheng Yu,et al.  NILE: Fast Natural Language Processing for Electronic Health Records , 2013 .

[27]  Yaoyun Zhang,et al.  Clinical Abbreviation Disambiguation Using Neural Word Embeddings , 2015, BioNLP@IJCNLP.

[28]  Qing Zeng-Treitler,et al.  A Suite of Natural Language Processing Tools Developed for the I2B2 Project , 2006, AMIA.

[29]  Antonio Jimeno-Yepes,et al.  Knowledge-Based Biomedical Word Sense Disambiguation with Neural Concept Embeddings , 2016, 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE).

[30]  Mikhail Khodak,et al.  A La Carte Embedding: Cheap but Effective Induction of Semantic Feature Vectors , 2018, ACL.

[31]  Mark Stevenson,et al.  Disambiguation of ambiguous biomedical terms using examples generated from the UMLS Metathesaurus , 2010, J. Biomed. Informatics.

[32]  Nigam H. Shah,et al.  Electronic phenotyping with APHRODITE and the Observational Health Sciences and Informatics (OHDSI) data network , 2017, CRI.

[33]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[34]  Siddhartha Jonnalagadda,et al.  Integrated cTAKES for Concept Mention Detection and Normalization , 2013, CLEF.

[35]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[36]  Antonio Jimeno-Yepes,et al.  Word embeddings and recurrent neural networks based on Long-Short Term Memory nodes in supervised biomedical word sense disambiguation , 2017, J. Biomed. Informatics.

[37]  Griffin M. Weber,et al.  Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2) , 2010, J. Am. Medical Informatics Assoc..

[38]  Mark Stevenson,et al.  Scaling up WSD with Automatically Generated Examples , 2012, BioNLP@HLT-NAACL.

[39]  Genevieve B. Melton,et al.  Challenges and Practical Approaches with Word Sense Disambiguation of Acronyms and Abbreviations in the Clinical Domain , 2015, Healthcare informatics research.

[40]  Katrin Kirchhoff,et al.  Unsupervised Resolution of Acronyms and Abbreviations in Nursing Notes Using Document-Level Context Models , 2016, Louhi@EMNLP.

[41]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.

[42]  Serguei V. S. Pakhomov,et al.  Automated Disambiguation of Acronyms and Abbreviations in Clinical Texts: Window and Training Size Considerations , 2012, AMIA.

[43]  Anna Okula Basile,et al.  Informatics and machine learning to define the phenotype , 2018, Expert review of molecular diagnostics.

[44]  G. Tang,et al.  Indian Hedgehog: A Mechanotransduction Mediator in Condylar Cartilage , 2004, Journal of dental research.