MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.

[1]  Heng Ji,et al.  Exploiting Task-Oriented Resources to Learn Word Embeddings for Clinical Abbreviation Expansion , 2015, BioNLP@IJCNLP.

[2]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[3]  Xinghua Lu,et al.  Deep Contextualized Biomedical Abbreviation Expansion , 2019, BioNLP@ACL.

[4]  Florian Schmidt,et al.  Neural Document Embeddings for Intensive Care Patient Mortality Prediction , 2016, NIPS 2016.

[5]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[6]  Michael Brudno,et al.  Training without training data: Improving the generalizability of automated medical abbreviation disambiguation , 2019, ML4H@NeurIPS.

[7]  Dragomir R. Radev,et al.  A Neural Topic-Attention Model for Medical Term Abbreviation Disambiguation , 2019, ArXiv.

[8]  Neil R. Smalheiser,et al.  ADAM: another database of abbreviations in MEDLINE , 2006, Bioinform..

[9]  Reed McEwan,et al.  Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data , 2016, AMIA.

[10]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[11]  Bharath Dandala,et al.  A convolutional route to abbreviation disambiguation in clinical text , 2018, J. Biomed. Informatics.

[12]  Jose Davila-Velderrain,et al.  Inferring multimodal latent topics from electronic health records , 2020, Nature Communications.

[13]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Zhiyong Lu,et al.  Understanding PubMed® user search behavior through log analysis , 2009, Database J. Biol. Databases Curation.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Carol Friedman,et al.  A Study of Abbreviations in Clinical Notes , 2007, AMIA.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Serguei V. S. Pakhomov,et al.  A sense inventory for clinical abbreviations and acronyms created using clinical notes and medical dictionary resources , 2014, J. Am. Medical Informatics Assoc..

[20]  Hongfang Liu,et al.  Journal of Biomedical Informatics , 2022 .