Joint learning of representations of medical concepts and words from EHR data

There has been an increasing interest in learning low-dimensional vector representations of medical concepts from electronic health records (EHRs). While EHRs contain structured data such as diagnostic codes and laboratory tests, they also contain unstructured clinical notes, which provide more nuanced details on a patient's health status. In this work, we propose a method that jointly learns medical concept and word representations. In particular, we focus on capturing the relationship between medical codes and words by using a novel learning scheme for word2vec model. Our method exploits relationships between different parts of EHRs in the same visit and embeds both codes and words in the same continuous vector space. In the end, we are able to derive clusters which reflect distinct disease and treatment patterns. In our experiments, we qualitatively show how our methods of grouping words for given diagnostic codes compares with a topic modeling approach. We also test how well our representations can be used to predict disease patterns of the next visit. The results show that our approach outperforms several common methods.

[1]  L. Ohno-Machado,et al.  “Big Data” and the Electronic Health Record , 2014, Yearbook of Medical Informatics.

[2]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[3]  Tapio Salakoski,et al.  Care episode retrieval: distributional semantic models for information retrieval in the clinical domain , 2015, BMC Medical Informatics and Decision Making.

[4]  Ramesh Nallapati,et al.  Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora , 2009, EMNLP.

[5]  Gunnar Rätsch,et al.  An Empirical Analysis of Topic Modeling for Mining Cancer Clinical Notes , 2013, bioRxiv.

[6]  Zoran Obradovic,et al.  Modeling Healthcare Quality via Compact Representations of Electronic Health Records , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[8]  Jimeng Sun,et al.  Using recurrent neural network models for early detection of heart failure onset , 2016, J. Am. Medical Informatics Assoc..

[9]  Shamim Nemati,et al.  A visualization of evolving clinical sentiment using vector representations of clinical notes , 2015, 2015 Computing in Cardiology Conference (CinC).

[10]  S. Brunak,et al.  Mining electronic health records: towards better research applications and clinical care , 2012, Nature Reviews Genetics.

[11]  Athanasios V. Vasilakos,et al.  Big data: From beginning to future , 2016, Int. J. Inf. Manag..

[12]  Ieee Staff 2014 Computing in Cardiology Conference (CinC) , 2014 .

[13]  Anna Rumshisky,et al.  Unfolding physiological state: mortality modelling in intensive care units , 2014, KDD.

[14]  Vipin Kumar,et al.  Mining Electronic Health Records: A Survey , 2017, ArXiv.

[15]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[16]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[17]  David Sontag,et al.  Learning Low-Dimensional Representations of Medical Concepts , 2016, CRI.

[18]  Jimeng Sun,et al.  Multi-layer Representation Learning for Medical Concepts , 2016, KDD.

[19]  Aron Henriksson Representing Clinical Notes for Adverse Drug Event Detection , 2015, Louhi@EMNLP.

[20]  Yaoyun Zhang,et al.  A Study of Neural Word Embeddings for Named Entity Recognition in Clinical Text , 2015, AMIA.

[21]  Adler J. Perotte,et al.  Learning probabilistic phenotypes from heterogeneous EHR data , 2015, J. Biomed. Informatics.

[22]  Alex A T Bui,et al.  Clinical Case-based Retrieval Using Latent Topic Analysis. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[23]  Guido Zuccon,et al.  Medical Semantic Similarity with a Neural Language Model , 2014, CIKM.

[24]  Joydeep Ghosh,et al.  Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization , 2016, MLHC.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.