Synthesizing electronic health records using improved generative adversarial networks

Objective The aim of this study was to generate synthetic electronic health records (EHRs). The generated EHR data will be more realistic than those generated using the existing medical Generative Adversarial Network (medGAN) method. Materials and Methods We modified medGAN to obtain two synthetic data generation models-designated as medical Wasserstein GAN with gradient penalty (medWGAN) and medical boundary-seeking GAN (medBGAN)-and compared the results obtained using the three models. We used 2 databases: MIMIC-III and National Health Insurance Research Database (NHIRD), Taiwan. First, we trained the models and generated synthetic EHRs by using these three 3 models. We then analyzed and compared the models' performance by using a few statistical methods (Kolmogorov-Smirnov test, dimension-wise probability for binary data, and dimension-wise average count for count data) and 2 machine learning tasks (association rule mining and prediction). Results We conducted a comprehensive analysis and found our models were adequately efficient for generating synthetic EHR data. The proposed models outperformed medGAN in all cases, and among the 3 models, boundary-seeking GAN (medBGAN) performed the best. Discussion To generate realistic synthetic EHR data, the proposed models will be effective in the medical industry and related research from the viewpoint of providing better services. Moreover, they will eliminate barriers including limited access to EHR data and thus accelerate research on medical informatics. Conclusion The proposed models can adequately learn the data distribution of real EHRs and efficiently generate realistic synthetic EHRs. The results show the superiority of our models over the existing model.

[1]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[2]  Joseph S. Lombardo,et al.  A method for generation and distribution of synthetic medical record data for evaluation of disease-monitoring systems , 2008 .

[3]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[4]  Jimeng Sun,et al.  Predicting changes in hypertension control using electronic health records from a chronic disease management program , 2014, J. Am. Medical Informatics Assoc..

[5]  Adam Wright,et al.  An automated technique for identifying associations between medications, laboratory results and problems , 2010, J. Biomed. Informatics.

[6]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Mark Kramer,et al.  Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record , 2017, J. Am. Medical Informatics Assoc..

[8]  Nigam H. Shah,et al.  Toward personalizing treatment for depression: predicting diagnosis and severity , 2014, J. Am. Medical Informatics Assoc..

[9]  Yi-Hsuan Yang,et al.  MidiNet: A Convolutional Generative Adversarial Network for Symbolic-Domain Music Generation , 2017, ISMIR.

[10]  Ian J. Goodfellow,et al.  NIPS 2016 Tutorial: Generative Adversarial Networks , 2016, ArXiv.

[11]  Yingtao Tian,et al.  Towards the Automatic Anime Characters Creation with Generative Adversarial Networks , 2017, ArXiv.

[12]  Scott T. Weiss,et al.  Prediction of chronic obstructive pulmonary disease (COPD) in asthma patients using electronic medical records. , 2009, Journal of the American Medical Informatics Association : JAMIA.

[13]  Jan Kautz,et al.  High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Gyeong Ho Lee,et al.  Diagnostic Analysis of Patients with Essential Hypertension Using Association Rule Mining , 2010, Healthcare informatics research.

[15]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[16]  Anna L. Buczak,et al.  Data-driven approach for creating synthetic electronic medical records , 2010, BMC Medical Informatics Decis. Mak..

[17]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[18]  Yoshua Bengio,et al.  Boundary-Seeking Generative Adversarial Networks , 2017, ICLR 2017.

[19]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[20]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[21]  Bradley Malin,et al.  Anonymising and sharing individual patient data , 2015, BMJ : British Medical Journal.

[22]  Joshua C Denny,et al.  Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals , 2017, J. Am. Medical Informatics Assoc..

[23]  Jason Roy,et al.  Prediction Modeling Using EHR Data: Challenges, Strategies, and a Comparison of Machine Learning Approaches , 2010, Medical care.

[24]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[25]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[26]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[27]  Scott McLachlan Realism in synthetic data generation : a thesis presented in fulfilment of the requirements for the degree of Master of Philosophy in Science, School of Engineering and Advanced Technology, Massey University, Palmerston North, New Zealand , 2017 .

[28]  Lantao Yu,et al.  SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient , 2016, AAAI.

[29]  Matt J. Kusner,et al.  GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution , 2016, ArXiv.

[30]  S. Sitharama Iyengar,et al.  Data-Driven Techniques in Disaster Information Management , 2017, ACM Comput. Surv..

[31]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[32]  Kudakwashe Dube,et al.  Using the CareMap with Health Incidents Statistics for Generating the Realistic Synthetic Electronic Healthcare Record , 2016, 2016 IEEE International Conference on Healthcare Informatics (ICHI).

[33]  Joydeep Ghosh,et al.  Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[34]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[35]  Jan Kautz,et al.  MoCoGAN: Decomposing Motion and Content for Video Generation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Antonio Torralba,et al.  Generating Videos with Scene Dynamics , 2016, NIPS.

[37]  Yike Guo,et al.  Unsupervised Image-to-Image Translation with Generative Adversarial Networks , 2017, ArXiv.

[38]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[39]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Vipin Kumar,et al.  Mining Electronic Health Records: A Survey , 2017, 1702.03222.