Privacy-preserving generative deep neural networks support clinical data sharing

Though it is widely recognized that data sharing enables faster scientific progress, the sensible need to protect participant privacy hampers this practice in medicine. We train deep neural networks that generate synthetic subjects closely resembling study participants. Using the SPRINT trial as an example, we show that machine-learning models built from simulated participants generalize to the original dataset. We incorporate differential privacy, which offers strong guarantees on the likelihood that a subject could be identified as a member of the trial. Investigators who have compiled a dataset can use our method to provide a freely accessible public version that enables other scientists to perform discovery-oriented analyses. Generated data can be released alongside analytical code to enable fully reproducible workflows, even when privacy is a concern. By addressing data sharing challenges, deep neural networks can facilitate the rigorous and reproducible investigation of clinical datasets. One Sentence Summary Deep neural networks can generate shareable biomedical data to allow reanalysis while preserving the privacy of study participants.

[1]  Jonathon Shlens,et al.  Conditional Image Synthesis with Auxiliary Classifier GANs , 2016, ICML.

[2]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[3]  Joydeep Ghosh,et al.  Perturbed Gibbs Samplers for Generating Large-Scale Privacy-Safe Synthetic Health Data , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[4]  Stephen W Lagakos,et al.  Statistics in medicine--reporting of subgroup analyses in clinical trials. , 2007, The New England journal of medicine.

[5]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[6]  Thomas Steinke,et al.  Differential Privacy: A Primer for a Non-Technical Audience , 2018 .

[7]  Somesh Jha,et al.  Privacy in Pharmacogenetics: An End-to-End Case Study of Personalized Warfarin Dosing , 2014, USENIX Security Symposium.

[8]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[9]  Bradley Malin,et al.  How (not) to protect genomic data privacy in a distributed network: using trail re-identification to evaluate and design anonymity protection systems , 2004, J. Biomed. Informatics.

[10]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[11]  Russ B Altman,et al.  Health-information altruists--a potentially critical resource. , 2005, The New England journal of medicine.

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[14]  S. Nelson,et al.  Resolving Individuals Contributing Trace Amounts of DNA to Highly Complex Mixtures Using High-Density SNP Genotyping Microarrays , 2008, PLoS genetics.

[15]  Simson L. Garfinkel,et al.  De-Identification of Personal Information , 2015 .

[16]  Vitaly Shmatikov,et al.  Robust De-anonymization of Large Sparse Datasets , 2008, 2008 IEEE Symposium on Security and Privacy (sp 2008).

[17]  Matt J. Kusner,et al.  GANS for Sequences of Discrete Elements with the Gumbel-softmax Distribution , 2016, ArXiv.

[18]  Yifan Peng,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017 .

[19]  Jimeng Sun,et al.  Generating Multi-label Discrete Electronic Health Records using Generative Adversarial Networks , 2017, ArXiv.

[20]  Gunnar Rätsch,et al.  Real-valued (Medical) Time Series Generation with Recurrent Conditional GANs , 2017, ArXiv.

[21]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[22]  Brett K. Beaulieu-Jones,et al.  Reproducibility of computational workflows is automated using continuous analysis , 2017, Nature Biotechnology.

[23]  Latanya Sweeney,et al.  Identifying Participants in the Personal Genome Project by Name , 2013, ArXiv.

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  John M. Abowd,et al.  The U.S. Census Bureau Adopts Differential Privacy , 2018, KDD.

[27]  Bonnie Berger,et al.  Realizing privacy preserving genome-wide association studies , 2016, Bioinform..

[28]  Sanjay Basu,et al.  Benefit and harm of intensive blood pressure treatment: Derivation and validation of risk models using data from the SPRINT and ACCORD trials , 2017, PLoS medicine.

[29]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[30]  Jackson T. Wright,et al.  A Randomized Trial of Intensive versus Standard Blood-Pressure Control. , 2016, The New England journal of medicine.

[31]  Somesh Jha,et al.  Model Inversion Attacks that Exploit Confidence Information and Basic Countermeasures , 2015, CCS.

[32]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[33]  Latanya Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[34]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[35]  Nicholas Chiu,et al.  Abstract P121: The Effect of Intensive Blood Pressure Targets on Mortality in Chronic Kidney Disease Patients: An Individual Patient Data Meta-Analysis of Randomized Controlled Trials , 2018, Hypertension.