Differentially Private Synthetic Medical Data Generation using Convolutional GANs

Deep learning models have demonstrated superior performance in several application problems, such as image classification and speech processing. However, creating a deep learning model using health record data requires addressing certain privacy challenges that bring unique concerns to researchers working in this domain. One effective way to handle such private data issues is to generate realistic synthetic data that can provide practically acceptable data quality and correspondingly the model performance. To tackle this challenge, we develop a differentially private framework for synthetic data generation using Rényi differential privacy. Our approach builds on convolutional autoencoders and convolutional generative adversarial networks to preserve some of the critical characteristics for the generated synthetic data. In addition, our model can also capture the temporal information and feature correlations that might be present in the original data. We demonstrate that our model outperforms existing state-of-the-art models under the same privacy budget using several publicly available benchmark medical datasets in both supervised and unsupervised settings.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  C. Villani Optimal Transport: Old and New , 2008 .

[3]  A. Rényi On Measures of Entropy and Information , 1961 .

[4]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[5]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[6]  G.B. Moody,et al.  The impact of the MIT-BIH Arrhythmia Database , 2001, IEEE Engineering in Medicine and Biology Magazine.

[7]  Emiliano De Cristofaro,et al.  LOGAN: Membership Inference Attacks Against Generative Models , 2017, Proc. Priv. Enhancing Technol..

[8]  Ralf Bousseljot,et al.  Nutzung der EKG-Signaldatenbank CARDIODAT der PTB über das Internet , 2009 .

[9]  Junqiao Chen,et al.  The validity of synthetic clinical data: a validation study of a leading synthetic data generator (Synthea) using clinical quality measures , 2019, BMC Medical Informatics and Decision Making.

[10]  Fei Wang,et al.  Differentially Private Generative Adversarial Network , 2018, ArXiv.

[11]  Rachel Cummings,et al.  Differentially Private Mixed-Type Data Generation For Unsupervised Learning , 2019, ArXiv.

[12]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..

[13]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[14]  Claude Castelluccia,et al.  Differentially Private Mixture of Generative Neural Networks , 2019, IEEE Transactions on Knowledge and Data Engineering.

[15]  Edward A. Fox,et al.  CorGAN: Correlation-Capturing Convolutional Generative Adversarial Networks for Generating Synthetic Healthcare Records , 2020, FLAIRS Conference.

[16]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[17]  Ilya Mironov,et al.  Rényi Differential Privacy , 2017, 2017 IEEE 30th Computer Security Foundations Symposium (CSF).

[18]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[19]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[20]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[21]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[22]  Wanda Pratt,et al.  Creating synthetic patient data to support the design and evaluation of novel health information technology , 2019, J. Biomed. Informatics.

[23]  G. Crooks On Measures of Entropy and Information , 2015 .

[24]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[25]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[26]  Mohammad Al-Rubaie,et al.  Privacy-Preserving Machine Learning: Threats and Solutions , 2018, IEEE Security & Privacy.

[27]  K Lehnertz,et al.  Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. , 2001, Physical review. E, Statistical, nonlinear, and soft matter physics.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[30]  Mihaela van der Schaar,et al.  PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees , 2018, ICLR.

[31]  Martín Abadi,et al.  Semi-supervised Knowledge Transfer for Deep Learning from Private Training Data , 2016, ICLR.

[32]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[33]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[34]  Chao-Lin Liu,et al.  Synthesizing electronic health records using improved generative adversarial networks , 2018, J. Am. Medical Informatics Assoc..

[35]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[36]  Guy N. Rothblum,et al.  Boosting and Differential Privacy , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[37]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[38]  Jaime S. Cardoso,et al.  Transfer Learning with Partial Observability Applied to Cervical Cancer Screening , 2017, IbPRIA.

[39]  Yan Li,et al.  A distributed ensemble approach for mining healthcare data under privacy constraints , 2016, Inf. Sci..

[40]  Sheng Yu,et al.  Generation of Synthetic Electronic Medical Record Text , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[41]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[42]  Reihaneh Torkzadehmahani,et al.  DP-CGAN: Differentially Private Synthetic Data and Label Generation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[43]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[44]  Yoshua Bengio,et al.  Boundary-Seeking Generative Adversarial Networks , 2017, ICLR 2017.

[45]  Li Zhang,et al.  Rényi Differential Privacy of the Sampled Gaussian Mechanism , 2019, ArXiv.

[46]  Bhavani M. Thuraisingham,et al.  Privacy Preserving Synthetic Data Release Using Deep Learning , 2018, ECML/PKDD.

[47]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[48]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[49]  Jeffrey M. Hausdorff,et al.  Physionet: Components of a New Research Resource for Complex Physiologic Signals". Circu-lation Vol , 2000 .

[50]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[51]  Kilian Q. Weinberger,et al.  An empirical study on evaluation metrics of generative adversarial networks , 2018, ArXiv.

[52]  Mark Kramer,et al.  Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record , 2017, J. Am. Medical Informatics Assoc..

[53]  Úlfar Erlingsson,et al.  Scalable Private Learning with PATE , 2018, ICLR.

[54]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.