MedJEx: A Medical Jargon Extraction Model with Wiki’s Hyperlink Span and Contextualized Masked Language Model Score

This paper proposes a new natural language processing (NLP) application for identifying medical jargon terms potentially difficult for patients to comprehend from electronic health record (EHR) notes. We first present a novel and publicly available dataset with expert-annotated medical jargon terms from 18K+ EHR note sentences (MedJ). Then, we introduce a novel medical jargon extraction (MedJEx) model which has been shown to outperform existing state-of-the-art NLP models. First, MedJEx improved the overall performance when it was trained on an auxiliary Wikipedia hyperlink span dataset, where hyperlink spans provide additional Wikipedia articles to explain the spans (or terms), and then fine-tuned on the annotated MedJ data. Secondly, we found that a contextualized masked language model score was beneficial for detecting domain-specific unfamiliar jargon terms. Moreover, our results show that training on the auxiliary Wikipedia hyperlink span datasets improved six out of eight biomedical named entity recognition benchmark datasets. MedJEx is publicly available.

[1]  Tiago Pimentel,et al.  Analyzing Wrap-Up Effects through an Information-Theoretic Lens , 2022, ACL.

[2]  Jongchan Kim,et al.  SPARClink: an interactive tool to visualize the impact of the SPARC program , 2021, bioRxiv.

[3]  John P. Lalor,et al.  Evaluating the Effectiveness of NoteAid in a Community Hospital Setting: Randomized Trial of Electronic Health Record Note Comprehension Interventions With Patients , 2021, Journal of medical Internet research.

[4]  Zexuan Zhong,et al.  Factual Probing Is [MASK]: Learning vs. Learning to Recall , 2021, NAACL.

[5]  Xinning Gui,et al.  Self-Diagnosis through AI-enabled Chatbot-based Symptom Checkers: User Experiences and Design Considerations , 2021, AMIA.

[6]  Herman Aguinis,et al.  MTurk Research: Review and Recommendations , 2020, Journal of Management.

[7]  Veselin Stoyanov,et al.  Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art , 2020, CLINICALNLP.

[8]  H. Kaka,et al.  UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus , 2020, NAACL.

[9]  Zina M. Ibrahim,et al.  Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit , 2020, Artif. Intell. Medicine.

[10]  H. H. Mao,et al.  A Survey on Self-supervised Pre-training for Sequential Transfer Learning in Neural Networks , 2020, ArXiv.

[11]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[12]  Richard Dobson,et al.  MedCAT - Medical Concept Annotation Tool , 2019, ArXiv.

[13]  Jiyeon Han,et al.  Why Do Masked Neural Language Models Still Need Common Sense Knowledge? , 2019, ArXiv.

[14]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[15]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[16]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[17]  Sebastian Ruder,et al.  Neural transfer learning for natural language processing , 2019 .

[18]  Wei-Hung Weng,et al.  Publicly Available Clinical BERT Embeddings , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[19]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[20]  Youngjoong Ko,et al.  Effective vector representation for the Korean named-entity recognition , 2019, Pattern Recognit. Lett..

[21]  Hong Yu,et al.  Overview of the First Natural Language Processing Challenge for Extracting Medication, Indication, and Adverse Drug Events from Electronic Health Record Notes (MADE 1.0) , 2019, Drug Safety.

[22]  B. White,et al.  Training to Improve Communication Quality: An Efficient Interdisciplinary Experience for Emergency Department Clinicians , 2018, American journal of medical quality : the official journal of the American College of Medical Quality.

[23]  Chao Yang,et al.  A Survey on Deep Transfer Learning , 2018, ICANN.

[24]  Yue Zhang,et al.  Design Challenges and Misconceptions in Neural Sequence Labeling , 2018, COLING.

[25]  Michael Hogarth,et al.  Text Simplification Using Consumer Health Vocabulary to Generate Patient-Centered Radiology Reporting: Translation and Evaluation , 2017, Journal of medical Internet research.

[26]  R. Zamaletdinov,et al.  Evaluating Text Complexity and Flesch-Kincaid Grade Level , 2017 .

[27]  Hong Yu,et al.  Ranking Medical Terms to Support Expansion of Lay Language Resources for Patient Comprehension of Electronic Health Record Notes: Adapted Distant Supervision Approach , 2017, JMIR medical informatics.

[28]  Hong Yu,et al.  Unsupervised Ensemble Ranking of Terms in Electronic Health Record Notes Based on Their Importance to Patients , 2017, J. Biomed. Informatics.

[29]  Dina Demner-Fushman,et al.  MetaMap Lite: an evaluation of a new Java implementation of MetaMap , 2017, J. Am. Medical Informatics Assoc..

[30]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[31]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[32]  Zhiyong Lu,et al.  The CHEMDNER corpus of chemicals and drugs and its annotation principles , 2015, Journal of Cheminformatics.

[33]  Richárd Farkas,et al.  SZTE-NLP: Clinical Text Analysis with Named Entity Recognition , 2014, *SEMEVAL.

[34]  Maria Kvist,et al.  Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language , 2014, PITR@EACL.

[35]  Anthony N. Nguyen,et al.  Identify Disorders in Health Records using Conditional Random Fields and Metamap AEHRC at ShARe/CLEF 2013 eHealth Evaluation Lab Task 1 , 2013, CLEF.

[36]  L. Jensen,et al.  The SPECIES and ORGANISMS Resources for Fast and Accurate Identification of Taxonomic Names in Text , 2013, PloS one.

[37]  Zhiyong Lu,et al.  An improved corpus of disease mentions in PubMed citations , 2012, BioNLP@HLT-NAACL.

[38]  Wiebke Wagner,et al.  Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit , 2010, Lang. Resour. Evaluation.

[39]  Qing Zeng-Treitler,et al.  A semantic and syntactic text simplification tool for health content. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[40]  R. Adams,et al.  Improving health outcomes with better patient understanding and education , 2010, Risk management and healthcare policy.

[41]  Min Li,et al.  High accuracy information extraction of medication information from clinical notes: 2009 i2b2 medication extraction challenge , 2010, J. Am. Medical Informatics Assoc..

[42]  Goran Nenadic,et al.  LINNAEUS: A species name identification system for biomedical literature , 2010, BMC Bioinformatics.

[43]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[44]  Rada Mihalcea,et al.  Wikify!: linking documents to encyclopedic knowledge , 2007, CIKM '07.

[45]  Alla Keselman,et al.  Making Texts in Electronic Health Records Comprehensible to Consumers: A Prototype Translator , 2007, AMIA.

[46]  Zyad Shaaban,et al.  Normalization as a Preprocessing Engine for Data Mining and the Approach of Preference Matrix , 2006, 2006 International Conference on Dependability of Computer Systems.

[47]  Nigel Collier,et al.  Introduction to the Bio-entity Recognition Task at JNLPBA , 2004, NLPBA/BioNLP.

[48]  Ricardo Fraiman,et al.  An anova test for functional data , 2004, Comput. Stat. Data Anal..

[49]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[50]  Anette Hulth,et al.  Improved Automatic Keyword Extraction Given More Linguistic Knowledge , 2003, EMNLP.

[51]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[52]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[53]  B. Fisher,et al.  Giving patients their own records in general practice: experience of patients and staff. , 1986, British medical journal.

[54]  Andrew J. Viterbi,et al.  Error bounds for convolutional codes and an asymptotically optimum decoding algorithm , 1967, IEEE Trans. Inf. Theory.

[55]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[56]  F. G. Crookshank Shock , 1889, The Hospital.

[57]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[58]  Gholamreza Haffari,et al.  Neural Versus Non-Neural Text Simplification: A Case Study , 2019, ALTA.

[59]  Alfan Farizki Wicaksono,et al.  Keyphrases Extraction from User-Generated Contents in Healthcare Domain Using Long Short-Term Memory Networks , 2018, BioNLP.

[60]  Luca Soldaini QuickUMLS: a fast, unsupervised approach for medical concept extraction , 2016 .

[61]  Cynthia Brandt,et al.  Improving Patients' Electronic Health Record Comprehension with NoteAid , 2013, MedInfo.

[62]  Q. Zeng,et al.  Exploring and Developing Consumer Health Vocabularies , 2005 .

[63]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[64]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.