Examining the Utility of Differentially Private Synthetic Data Generated using Variational Autoencoder with TensorFlow Privacy

With the emergence of AI(artificial intelligence), it is becoming more and more critical for organizations to utilize it to their advantage. However, organizations that possess a decent amount of data might not have the technical competence to perform machine learning, and vice versa. Hence, it is reasonable for the two kinds of organizations to work together to realize the value of the data. With the increasing concern over data privacy, regulations such as GDPR(General Data Protection Regulation) prevent an organization from sharing data with another unless the data is processed to the point that the individuals in the data are not identifiable. Various ways of data anonymization have been proposed and developed, including the ones that utilize neural networks to achieve the goal, like AE, VAE, and GAN. With the addition of a differential privacy framework like TensorFlow Privacy, privacy can be guaranteed, but data still needs to be usable after privacy protection measures are deployed. The present study aims to integrate TensorFlow Privacy into the synthetic data generation process and evaluate its usefulness for daily use in the industries. Since TensorFlow Privacy brings a provable privacy guarantee to synthetic data, the present study focuses on the evaluation of data utility. TensorFlow is widely used for machine learning in the industry and academically. TensorFlow Privacy, which is also developed by Google, can prove to be a valuable addition to the synthetic data generation pipeline. The result shows that VAE with TensorFlow Privacy 1) generates synthetic data with good data utility in most cases in terms of descriptive statistics and machine learning classification tasks, and 2) The customizable TensorFlow Privacy parameters work as intended in terms of privacy-utility trade-off.

[1]  Yennun Huang,et al.  Examining Compliance with Personal Data Protection Regulations in Interorganizational Data Analysis , 2021, Sustainability.

[2]  Youyang Qu,et al.  DP-GAN: Differentially private consecutive data publishing using generative adversarial nets , 2021, J. Netw. Comput. Appl..

[3]  Hajime Ono,et al.  Differentially Private Variational Autoencoders with Term-wise Gradient Aggregation , 2020, ArXiv.

[4]  Linda Coyle,et al.  Generation and evaluation of synthetic patient data , 2020, BMC Medical Research Methodology.

[5]  Yennun Huang,et al.  Evaluating Variational Autoencoder as a Private Data Release Mechanism for Tabular Data , 2019, 2019 IEEE 24th Pacific Rim International Symposium on Dependable Computing (PRDC).

[6]  Zhuolin Yang,et al.  Scalable Differentially Private Data Generation via Private Aggregation of Teacher Ensembles , 2019 .

[7]  Bo Li,et al.  Differentially Private Data Generative Models , 2018, ArXiv.

[8]  Ally Salim,et al.  Synthetic Patient Generation: A Deep Learning Approach Using Variational Autoencoders , 2018, ArXiv.

[9]  Haibo He,et al.  Variational autoencoder based synthetic data generation for imbalanced learning , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[10]  Jun Tang,et al.  Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 , 2017, ArXiv.

[11]  Ian Goodfellow,et al.  Deep Learning with Differential Privacy , 2016, CCS.

[12]  Yu Zhang,et al.  Differentially Private High-Dimensional Data Publication via Sampling-Based Inference , 2015, KDD.

[13]  Úlfar Erlingsson,et al.  RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response , 2014, CCS.

[14]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[15]  Aaron C. Courville,et al.  Generative Adversarial Nets , 2014, NIPS.

[16]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[17]  Ninghui Li,et al.  t-Closeness: Privacy Beyond k-Anonymity and l-Diversity , 2007, 2007 IEEE 23rd International Conference on Data Engineering.

[18]  Cynthia Dwork,et al.  Differential Privacy , 2006, ICALP.

[19]  ASHWIN MACHANAVAJJHALA,et al.  L-diversity: privacy beyond k-anonymity , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[20]  L. Sweeney,et al.  k-Anonymity: A Model for Protecting Privacy , 2002, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[21]  M. Kramer Nonlinear principal component analysis using autoassociative neural networks , 1991 .