Copula Flows for Synthetic Data Generation

The ability to generate high-fidelity synthetic data is crucial when available (real) data is limited or where privacy and data protection standards allow only for limited use of the given data, e.g., in medical and financial data-sets. Current state-of-the-art methods for synthetic data generation are based on generative models, such as Generative Adversarial Networks (GANs). Even though GANs have achieved remarkable results in synthetic data generation, they are often challenging to interpret. Furthermore, GAN-based methods can suffer when used with mixed real and categorical variables. Moreover, loss function (discriminator loss) design itself is problem specific, i.e., the generative model may not be useful for tasks it was not explicitly trained for. In this paper, we propose to use a probabilistic model as a synthetic data generator. Learning the probabilistic model for the data is equivalent to estimating the density of the data. Based on the copula theory, we divide the density estimation task into two parts, i.e., estimating univariate marginals and estimating the multivariate copula density over the univariate marginals. We use normalising flows to learn both the copula density and univariate marginals. We benchmark our method on both simulated and real data-sets in terms of density estimation as well as the ability to generate high-fidelity synthetic data.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[3]  Gal Elidan,et al.  Copula Bayesian Networks , 2010, NIPS.

[4]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[5]  Tomas E. Ward,et al.  Generative Adversarial Networks: A Survey and Taxonomy , 2019, ArXiv.

[6]  N. L. Johnson,et al.  The probability integral transformation when parameters are estimated from the sample. , 1948, Biometrika.

[7]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[8]  Alexander M. Rush,et al.  Latent Normalizing Flows for Discrete Sequences , 2019, ICML.

[9]  L. Rüschendorf On the distributional transform, Sklar's theorem, and the empirical copula process , 2009 .

[10]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[11]  Thomas Müller,et al.  Neural Importance Sampling , 2018, ACM Trans. Graph..

[12]  Harry Joe,et al.  Vine copula structure learning via Monte Carlo tree search , 2019, AISTATS.

[13]  Kumar Krishna Agrawal,et al.  Discrete Flows: Invertible Generative Models of Discrete Data , 2019, DGS@ICLR.

[14]  Xi Chen,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[15]  Iain Murray,et al.  Neural Spline Flows , 2019, NeurIPS.

[16]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[17]  Barnabás Póczos,et al.  Transformation Autoregressive Networks , 2018, ICML.

[18]  Zhiwei Steven Wu,et al.  Privacy-Preserving Generative Deep Neural Networks Support Clinical Data Sharing , 2017, bioRxiv.

[19]  Mihaela van der Schaar,et al.  PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees , 2018, ICLR.

[20]  Dimitris Karlis,et al.  Modeling Multivariate Count Data Using Copulas , 2009, Commun. Stat. Simul. Comput..

[21]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[22]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[23]  Iain Murray,et al.  Masked Autoregressive Flow for Density Estimation , 2017, NIPS.

[24]  Alexandre Lacoste,et al.  Neural Autoregressive Flows , 2018, ICML.

[25]  H. Joe Asymptotic efficiency of the two-stage estimation method for copula-based models , 2005 .

[26]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[27]  Aaron Roth,et al.  The Algorithmic Foundations of Differential Privacy , 2014, Found. Trends Theor. Comput. Sci..

[28]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[29]  Casey S. Greene,et al.  Privacy-preserving generative deep neural networks support clinical data sharing , 2017 .

[30]  B. Hansen Autoregressive Conditional Density Estimation , 1994 .

[31]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[32]  E. Tabak,et al.  DENSITY ESTIMATION BY DUAL ASCENT OF THE LOG-LIKELIHOOD ∗ , 2010 .

[33]  Josep Domingo-Ferrer,et al.  The future of statistical disclosure control , 2018, ArXiv.

[34]  Claudia Czado,et al.  Pair-Copula Constructions of Multivariate Copulas , 2010 .

[35]  Yu Cheng,et al.  Boosting Deep Learning Risk Prediction with Generative Adversarial Networks for Electronic Health Records , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[36]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[37]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[38]  Eric Nalisnick,et al.  Normalizing Flows for Probabilistic Modeling and Inference , 2019, J. Mach. Learn. Res..

[39]  Aapo Hyvärinen,et al.  Nonlinear independent component analysis: Existence and uniqueness results , 1999, Neural Networks.

[40]  Stefano Panzeri,et al.  Mixed vine copulas as joint models of spike counts and local field potentials , 2016, NIPS.

[41]  Zoubin Ghahramani,et al.  Gaussian Process Vine Copulas for Multivariate Dependence , 2013, ICML.

[42]  Bill Ravens,et al.  An Introduction to Copulas , 2000, Technometrics.

[43]  Hugo Larochelle,et al.  MADE: Masked Autoencoder for Distribution Estimation , 2015, ICML.

[44]  Sushil Jajodia,et al.  Data Synthesis based on Generative Adversarial Networks , 2018, Proc. VLDB Endow..

[45]  M. Sklar Fonctions de repartition a n dimensions et leurs marges , 1959 .

[46]  A. Frigessi,et al.  Pair-copula constructions of multiple dependence , 2009 .

[47]  J. Gregory,et al.  Piecewise rational quadratic interpola-tion to monotonic data , 1982 .