Generative Adversarial Networks for Synthetic Data Generation: A Comparative Study

Generative Adversarial Networks (GANs) are gaining increasing attention as a means for synthesising data. So far much of this work has been applied to use cases outside of the data confidentiality domain with a common application being the production of artificial images. Here we consider the potential application of GANs for the purpose of generating synthetic census microdata. We employ a battery of utility metrics and a disclosure risk metric (the Targeted Correct Attribution Probability) to compare the data produced by tabular GANs with those produced using orthodox data synthesis methods.

[1]  L. Cox Statistical Disclosure Limitation , 2006 .

[2]  Hayit Greenspan,et al.  GAN-based Synthetic Medical Image Augmentation for increased CNN Performance in Liver Lesion Classification , 2018, Neurocomputing.

[3]  Mark Elliot,et al.  End User Licence to Open Government Data? A Simulated Penetration Attack on Two Social Survey Datasets , 2016 .

[4]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[5]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[6]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[8]  Mehran Ebrahimi,et al.  Image Colorization Using Generative Adversarial Networks , 2018, AMDO.

[9]  Gözde B. Ünal,et al.  Patch-Based Image Inpainting with Generative Adversarial Networks , 2018, ArXiv.

[10]  Lei Xu,et al.  Synthesizing Tabular Data using Generative Adversarial Networks , 2018, ArXiv.

[11]  Jan Kautz,et al.  Multimodal Unsupervised Image-to-Image Translation , 2018, ECCV.

[12]  Jerome P. Reiter,et al.  Releasing multiply imputed, synthetic public use microdata: an illustration and empirical study , 2005 .

[13]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[14]  Jerome P. Reiter,et al.  Multiple Imputation for Statistical Disclosure Limitation , 2003 .

[15]  Sushil Jajodia,et al.  FakeTables: Using GANs to Generate Functional Dependency Preserving Tables with Bounded Real Data , 2019, IJCAI.

[16]  Jerome P. Reiter,et al.  Sampling With Synthesis: A New Approach for Releasing Public Use Census Microdata , 2010 .

[17]  Mark Elliot,et al.  The Impact of Synthetic Data Generation on Data Utility with Application to the 1991 UK Samples of Anonymised Records , 2020, Trans. Data Priv..

[18]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[19]  Ke Yan,et al.  Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks , 2019, Scientific Reports.

[20]  Mario Lucic,et al.  Are GANs Created Equal? A Large-Scale Study , 2017, NeurIPS.

[21]  Jerome P. Reiter,et al.  Using CART to generate partially synthetic public use microdata , 2005 .

[22]  Bill Howe,et al.  DataSynthesizer: Privacy-Preserving Synthetic Datasets , 2017, SSDBM.

[23]  Maria Pampaka,et al.  Differential Correct Attribution Probability for Synthetic Data: An Exploration , 2018, PSD.

[24]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Jerome P. Reiter,et al.  Satisfying Disclosure Restrictions With Synthetic Data Sets , 2002 .

[26]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[27]  Joshua Snoke,et al.  General and specific utility measures for synthetic data , 2016, 1604.06651.

[28]  Cecilio Angulo,et al.  Generating Synthetic ECGs Using GANs for Anonymizing Healthcare Data , 2021 .

[29]  Jörg Drechsler,et al.  An empirical evaluation of easily implemented, nonparametric methods for generating synthetic datasets , 2011, Comput. Stat. Data Anal..

[30]  Gillian M. Raab,et al.  synthpop: Bespoke Creation of Synthetic Data in R , 2016 .

[31]  Linda Coyle,et al.  Generation and evaluation of synthetic patient data , 2020, BMC Medical Research Methodology.

[32]  Robert Birke,et al.  CTAB-GAN: Effective Table Data Synthesizing , 2021, ACML.

[33]  Cynthia Rudin,et al.  PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  拓海 杉山,et al.  “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks”の学習報告 , 2017 .

[35]  G. Raab,et al.  Guidelines for Producing Useful Synthetic Data , 2017, 1712.04078.

[36]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[37]  Wei Chen,et al.  A State-of-the-Art Review on Image Synthesis With Generative Adversarial Networks , 2020, IEEE Access.

[38]  Anna Oganian,et al.  Global Measures of Data Utility for Microdata Masked for Disclosure Limitation , 2009, J. Priv. Confidentiality.

[39]  Anna Oganian,et al.  A Framework for Evaluating the Utility of Data Altered to Protect Confidentiality , 2006 .

[40]  Talha Iqbal,et al.  Generative Adversarial Network for Medical Images (MI-GAN) , 2018, Journal of Medical Systems.

[41]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..