Assessing privacy and quality of synthetic health data

This paper builds on the results of the ESANN 2019 conference paper "Privacy Preserving Synthetic Health Data" [16], which develops metrics for assessing privacy and utility of synthetic data and models. The metrics laid out in the initial paper show that utility can still be achieved in synthetic data while maintaining both privacy of the model and the data being generated. Specifically, we focused on the success of the Wasserstein GAN method, renamed HealthGAN, in comparison to other data generating methods. In this paper, we provide additional novel metrics to quantify the susceptibility of these generative models to membership inference attacks [14]. We also introduce Discriminator Testing, a new method of determining whether the different generators overfit on the training data, potentially resulting in privacy losses. These privacy issues are of high importance as we prepare a final workflow for generating synthetic data based on real data in a secure environment. The results of these tests complement the initial tests as they show that the Parzen windows method, while having a low privacy loss in adversarial accuracy metrics, fails to preserve privacy in the membership inference attack. Only HealthGAN shows both an optimal value for privacy loss and the membership inference attack. The discriminator testing adds to the confidence as HealthGAN retains resemblance to the training data, without reproducing the training data.

[1]  David G. Stork,et al.  Pattern Classification , 1973 .

[2]  Isabelle Guyon,et al.  Privacy Preserving Synthetic Health Data , 2019, ESANN.

[3]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[4]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[5]  Mario Fritz,et al.  ML-Leaks: Model and Data Independent Membership Inference Attacks and Defenses on Machine Learning Models , 2018, NDSS.

[6]  Emiliano De Cristofaro,et al.  LOGAN: Membership Inference Attacks Against Generative Models , 2017, Proc. Priv. Enhancing Technol..

[7]  Fabian Prasser,et al.  A Tool for Optimizing De-identified Health Data for Use in Statistical Classification , 2017, 2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS).

[8]  Vitaly Shmatikov,et al.  Membership Inference Attacks Against Machine Learning Models , 2016, 2017 IEEE Symposium on Security and Privacy (SP).

[9]  Cynthia Dwork,et al.  Differential Privacy: A Survey of Results , 2008, TAMC.

[10]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[11]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[12]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[13]  Wenqi Wei,et al.  Demystifying Membership Inference Attacks in Machine Learning as a Service , 2019, IEEE Transactions on Services Computing.

[14]  Peter Szolovits,et al.  MIMIC-III, a freely accessible critical care database , 2016, Scientific Data.

[15]  Bernhard Schölkopf,et al.  Nonlinear causal discovery with additive noise models , 2008, NIPS.

[16]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.