EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders

Researchers require timely access to real-world longitudinal electronic health records (EHR) to develop, test, validate, and implement machine learning solutions that improve the quality and efficiency of healthcare. In contrast, health systems value deeply patient privacy and data security. De-identified EHRs do not adequately address the needs of health systems, as de-identified data are susceptible to re-identification and its volume is also limited. Synthetic EHRs offer a potential solution. In this paper, we propose EHR Variational Autoencoder (EVA) for synthesizing sequences of discrete EHR encounters (e.g., clinical visits) and encounter features (e.g., diagnoses, medications, procedures). We illustrate that EVA can produce realistic EHR sequences, account for individual differences among patients, and can be conditioned on specific disease conditions, thus enabling disease-specific studies. We design efficient, accurate inference algorithms by combining stochastic gradient Markov Chain Monte Carlo with amortized variational inference. We assess the utility of the methods on large real-world EHR repositories containing over 250, 000 patients. Our experiments, which include user studies with knowledgeable clinicians, indicate the generated EHR sequences are realistic. We confirmed the performance of predictive models trained on the synthetic data are similar with those trained on real EHRs. Additionally, our findings indicate that augmenting real data with synthetic EHRs results in the best predictive performance improving the best baseline by as much as 8% in top-20 recall. ACM Reference Format: Siddharth Biswal, Soumya Ghosh, Jon Duke, Bradley Malin, Walter Stewart, and Jimeng Sun. 2020. EVA: Generating Longitudinal Electronic Health Records Using Conditional Variational Autoencoders. In Proceedings of . ACM, New York, NY, USA, 11 pages. https://doi.org/10.475/123_4

[1]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[2]  J. Henry,et al.  Adoption of Electronic Health Record Systems among U . S . Non-Federal Acute Care Hospitals : 2008-2015 , 2013 .

[3]  Walter F. Stewart,et al.  Doctor AI: Predicting Clinical Events via Recurrent Neural Networks , 2015, MLHC.

[4]  Jimeng Sun,et al.  Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization , 2014, KDD.

[5]  Casey S. Greene,et al.  Privacy-preserving generative deep neural networks support clinical data sharing , 2017 .

[6]  Frank D. Wood,et al.  Learning Disentangled Representations with Semi-Supervised Deep Generative Models , 2017, NIPS.

[7]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[8]  Erhardt Barth,et al.  A Hybrid Convolutional Variational Autoencoder for Text Generation , 2017, EMNLP.

[9]  Mark Kramer,et al.  Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record , 2017, J. Am. Medical Informatics Assoc..

[10]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[11]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[12]  Mike Wu,et al.  Multimodal Generative Models for Scalable Weakly-Supervised Learning , 2018, NeurIPS.

[13]  Anna L. Buczak,et al.  Data-driven approach for creating synthetic electronic medical records , 2010, BMC Medical Informatics Decis. Mak..

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Samy Bengio,et al.  Generating Sentences from a Continuous Space , 2015, CoNLL.

[16]  Yoshua Bengio,et al.  A Recurrent Latent Variable Model for Sequential Data , 2015, NIPS.

[17]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[18]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[19]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[20]  Eric P. Xing,et al.  Toward Controlled Generation of Text , 2017, ICML.

[21]  Yu Zhang,et al.  Unsupervised Learning of Disentangled and Interpretable Representations from Sequential Data , 2017, NIPS.

[22]  Kudakwashe Dube,et al.  The ATEN Framework for Creating the Realistic Synthetic Electronic Health Record , 2018, HEALTHINF.

[23]  Phil Blunsom,et al.  Language as a Latent Variable: Discrete Generative Models for Sentence Compression , 2016, EMNLP.

[24]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[25]  Jimeng Sun,et al.  Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics , 2015, KDD.

[26]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[27]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[28]  Katherine A. Heller,et al.  An Improved Multi-Output Gaussian Process RNN with Real-Time Validation for Early Sepsis Detection , 2017, MLHC.

[29]  Alán Aspuru-Guzik,et al.  Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules , 2016, ACS central science.

[30]  Bradley Malin,et al.  Anonymising and sharing individual patient data , 2015, BMJ : British Medical Journal.

[31]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[32]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[33]  Chao-Lin Liu,et al.  Synthesizing electronic health records using improved generative adversarial networks , 2018, J. Am. Medical Informatics Assoc..