Comparing High Dimensional Word Embeddings Trained on Medical Text to Bag-of-Words for Predicting Medical Codes

Word embeddings are a useful tool for extracting knowledge from the free-form text contained in electronic health records, but it has become commonplace to train such word embeddings on data that do not accurately reflect how language is used in a healthcare context. We use prediction of medical codes as an example application to compare the accuracy of word embeddings trained on health corpora to those trained on more general collections of text. It is shown that both an increase in embedding dimensionality and an increase in the volume of health-related training data improves prediction accuracy. We also present a comparison to the traditional bag-of-words feature representation, demonstrating that in many cases, this conceptually simple method for representing text results in superior accuracy to that of word embeddings.

[1]  Pengtao Xie,et al.  A Neural Architecture for Automated ICD Coding , 2017, ACL.

[2]  Hiroyuki Shindo,et al.  Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation , 2016, CoNLL.

[3]  David Sontag,et al.  Learning Low-Dimensional Representations of Medical Concepts , 2016, CRI.

[4]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[5]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[6]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[7]  Qingyu Chen,et al.  BioWordVec, improving biomedical word embeddings with subword information and MeSH , 2019, Scientific Data.

[8]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[9]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[10]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[11]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[12]  Mit Critical Data Erratum to: Secondary Analysis of Electronic Health Records , 2016 .

[13]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[14]  Gerard de Melo,et al.  Medical Concept Embeddings via Labeled Background Corpora , 2016, LREC.

[15]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Hong Yu,et al.  Bidirectional RNN for Medical Event Detection in Electronic Health Records , 2016, NAACL.

[18]  Mit Critical Data Secondary Analysis of Electronic Health Records , 2016 .

[19]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[20]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[21]  Ellen M. Voorhees,et al.  Overview of the TREC 2020 Precision Medicine Track , 2017, TREC.

[22]  Tianxi Cai,et al.  Clinical Concept Embeddings Learned from Massive Sources of Multimodal Medical Data , 2018, PSB.

[23]  Yan Liu,et al.  Benchmark of Deep Learning Models on Large Healthcare MIMIC Datasets , 2017, ArXiv.

[24]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[25]  Reed McEwan,et al.  Corpus domain effects on distributional semantic modeling of medical terms , 2016, Bioinform..

[26]  Xu Chen,et al.  Bridge Text and Knowledge by Learning Multi-Prototype Entity Mention Embedding , 2017, ACL.

[27]  Jimeng Sun,et al.  Using recurrent neural network models for early detection of heart failure onset , 2016, J. Am. Medical Informatics Assoc..

[28]  Yoav Goldberg,et al.  Neural Network Methods for Natural Language Processing , 2017, Synthesis Lectures on Human Language Technologies.

[29]  Yifan Peng,et al.  BioSentVec: creating sentence embeddings for biomedical texts , 2018, 2019 IEEE International Conference on Healthcare Informatics (ICHI).

[30]  Yan Liu,et al.  Benchmarking deep learning models on large healthcare datasets , 2018, J. Biomed. Informatics.