Data Analysis and Synthesis of COVID-19 Patients using Deep Generative Models: A Case Study of Jakarta, Indonesia

Two years have passed since COVID-19 broke out in Indonesia. In Indonesia, the central and regional governments have used vast amounts of data on COVID-19 patients for policymaking. However, it is clear that privacy problems can arise when people use their data. Thus, it is crucial to keep COVID-19 data private, using synthetic data publishing (SDP). One of the well-known SDP methods is by using deep generative models. This study explores the usage of deep generative models to synthesise COVID-19 individual data. The deep generative models used in this paper are Generative Adversarial Networks (GAN), Adversarial Autoencoders (AAE), and Adversarial Variational Bayes (AVB). This study found that AAE and AVB outperform GAN in loss, distribution, and privacy preservation, mainly when using the Wasserstein approach. Furthermore, the synthetic data produced predictions in the real dataset with sensitivity and an F1 score of more than 0.8. Unfortunately, the synthetic data produced still has drawbacks and biases, especially in conducting statistical models. Therefore, it is essential to improve the deep generative models, especially in maintaining the statistical guarantee of the dataset.

[1]  A. Suherman,et al.  COVID-19 Mortality Risk Factors Using Survival Analysis: A Case Study of Jakarta, Indonesia , 2023, IEEE Transactions on Computational Social Systems.

[2]  O. Geman,et al.  An Improved COVID-19 Detection using GAN-Based Data Augmentation and Novel QuNet-Based Classification , 2022, BioMed research international.

[3]  A. Suherman,et al.  An evidence-based culture: COVID-19 positivity factors during the asymptomatic occurrence in Jakarta, Indonesia , 2021, Science and Public Policy.

[4]  Robert Birke,et al.  DTGAN: Differential Private Training for Tabular GANs , 2021, ArXiv.

[5]  Luis Oliveros Colon,et al.  Causal-TGAN: Generating Tabular Data Using Causal Generative Adversarial Networks , 2021, ArXiv.

[6]  C. Troncoso,et al.  Synthetic Data - Anonymisation Groundhog Day , 2020, USENIX Security Symposium.

[7]  Jason A. Walonoski,et al.  Synthea™ Novel coronavirus (COVID-19) model and synthetic data set , 2020, Intelligence-Based Medicine.

[8]  K. El Emam Seven Ways to Evaluate the Utility of Synthetic Data , 2020, IEEE Security & Privacy.

[9]  Gorka Epelde,et al.  Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing , 2020, JMIR medical informatics.

[10]  Muh. Kadarisman Bureaucratic Communication in Provincial Government of Special Capital Region of Jakarta , 2020 .

[11]  Anat Reiner Benaim,et al.  Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies , 2020, JMIR medical informatics.

[12]  Chao-Lin Liu,et al.  Synthesizing electronic health records using improved generative adversarial networks , 2018, J. Am. Medical Informatics Assoc..

[13]  Lei Xu,et al.  Synthesizing Tabular Data using Generative Adversarial Networks , 2018, ArXiv.

[14]  Philip R. O. Payne,et al.  Are Synthetic Data Derivatives the Future of Translational Medicine? , 2018, JACC. Basic to translational science.

[15]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[16]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[17]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[18]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[19]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[20]  Navdeep Jaitly,et al.  Adversarial Autoencoders , 2015, ArXiv.

[21]  T. Choi,et al.  Bayesian networks with examples in R , 2015 .

[22]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[23]  Yulei He,et al.  Disclosure control using partially synthetic data for large‐scale health surveys, with applications to CanCORS , 2013, Statistics in medicine.

[24]  Pierre Baldi,et al.  Autoencoders, Unsupervised Learning, and Deep Architectures , 2011, ICML Unsupervised and Transfer Learning.

[25]  TemplMatthias Statistical Disclosure Control for Microdata Using the R-Package sdcMicro , 2008 .

[26]  M. Baak,et al.  Synthsonic: Fast, Probabilistic modeling and Synthesis of Tabular Data , 2022, AISTATS.

[27]  Jörg Drechsler Improved Variance Estimation for Fully Synthetic Datasets , 2011 .

[28]  Patrick Graham,et al.  Using Bayesian networks to create synthetic data , 2010 .