Generating synthetic data in finance: opportunities, challenges and pitfalls

Financial services generate a huge volume of data that is extremely complex and varied. These datasets are often stored in silos within organisations for various reasons, including but not limited to, regulatory requirements and business needs. As a result, data sharing within different lines of business as well as outside of the organisation (e.g. to the research community) is severely limited. It is therefore critical to investigate methods for synthesising financial datasets that follow the same properties of the real data while respecting the need for privacy of the parties involved in a particular dataset.<br><br>This introductory paper aims to highlight the growing need for effective synthetic data generation in the financial domain. We highlight three main areas of focus for the academic community: 1) Generating realistic synthetic datasets. 2) Measuring the similarities between real and generated datasets 3) Ensuring the generative process satisfies any privacy constraints.<br><br>Although these challenges are also present in other domains, the extra regulatory and privacy requirements add another dimension of complexity and offer a unique opportunity to study the topic in financial services. Finally, we aim to develop a shared vocabulary and context for generating synthetic financial data using two types of financial datasets as examples.

[1]  Pierangela Samarati,et al.  Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression , 1998 .

[2]  Phhilippe Jorion Value at Risk: The New Benchmark for Managing Financial Risk , 2000 .

[3]  Eric R. Ziegel,et al.  Analysis of Financial Time Series , 2002, Technometrics.

[4]  Charu C. Aggarwal,et al.  On k-Anonymity and the Curse of Dimensionality , 2005, VLDB.

[5]  B. LeBaron Agent-based Computational Finance , 2006 .

[6]  Cynthia Dwork,et al.  Calibrating Noise to Sensitivity in Private Data Analysis , 2006, TCC.

[7]  Chris Franke Family Educational Rights and Privacy Act (FERPA) , 2007, Journal of empirical research on human research ethics : JERHRE.

[8]  Lars Vilhuber,et al.  How Protective Are Synthetic Data? , 2008, Privacy in Statistical Databases.

[9]  Craig W. Thompson,et al.  Generating Synthetic Data to Match Data Mining Patterns , 2008, IEEE Internet Computing.

[10]  Kimmo Soramäki,et al.  An Agent-Based Model of Payment Systems , 2008 .

[11]  Leandro D'Aurizio,et al.  Exploring Agent-Based Methods for the Analysis of Payment Systems: A Crisis Model for StarLogo TNG , 2008, J. Artif. Soc. Soc. Simul..

[12]  Jerome P. Reiter,et al.  Random Forests for Generating Partially Synthetic, Categorical Data , 2010, Trans. Data Priv..

[13]  Cynthia Dwork,et al.  Differential privacy in new settings , 2010, SODA '10.

[14]  Moni Naor,et al.  Differential privacy under continual observation , 2010, STOC '10.

[15]  Jörg Drechsler,et al.  Using Support Vector Machines for Generating Synthetic Datasets , 2010, Privacy in Statistical Databases.

[16]  Stacy Williams,et al.  Limit order books , 2010, 1012.0349.

[17]  Ashwin Machanavajjhala,et al.  A rigorous and customizable framework for privacy , 2012, PODS.

[18]  Catuscia Palamidessi,et al.  Broadening the Scope of Differential Privacy Using Metrics , 2013, Privacy Enhancing Technologies.

[19]  Stavros Papadopoulos,et al.  Differentially Private Event Sequences over Infinite Streams , 2014, Proc. VLDB Endow..

[20]  Jun Zhang,et al.  PrivBayes: private data release via bayesian networks , 2014, SIGMOD Conference.

[21]  Joydeep Ghosh,et al.  PeGS: Perturbed Gibbs Samplers that Generate Privacy-Compliant Synthetic Data , 2014, Trans. Data Priv..

[22]  Justin Reich,et al.  Privacy, anonymity, and big data in the social sciences , 2014, Commun. ACM.

[23]  Justin Reich,et al.  Privacy, Anonymity, and Big Data in the Social Sciences , 2014 .

[24]  Xiaoqian Jiang,et al.  DPSynthesizer: Differentially Private Data Synthesizer for Privacy Preserving Data Sharing , 2014, Proc. VLDB Endow..

[25]  Vitaly Shmatikov,et al.  Privacy-preserving deep learning , 2015, 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[26]  Stefan Axelsson,et al.  Using the RetSim Fraud Simulation Tool to Set Thresholds for Triage of Retail Fraud , 2015, NordSec.

[27]  Matthias Bethge,et al.  A note on the evaluation of generative models , 2015, ICLR.

[28]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[29]  Kalyan Veeramachaneni,et al.  The Synthetic Data Vault , 2016, 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[30]  Jun Zhang,et al.  Algorithms for synthetic data release under differential privacy , 2016 .

[31]  Fang Liu,et al.  Comparative Study of Differentially Private Data Synthesis Methods , 2016, Statistical Science.

[32]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[33]  Emiliano De Cristofaro,et al.  Differentially Private Mixture of Generative Neural Networks , 2017, 2017 IEEE International Conference on Data Mining (ICDM).

[34]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[35]  H Surendra,et al.  A Review Of Synthetic Data Generation Methods For Privacy Preserving Data Publishing , 2017 .

[36]  Fang Liu,et al.  Enterprise data breach: causes, challenges, prevention, and future directions , 2017, WIREs Data Mining Knowl. Discov..

[37]  D. Donoho 50 Years of Data Science , 2017 .

[38]  Olivier Bachem,et al.  Assessing Generative Models via Precision and Recall , 2018, NeurIPS.

[39]  J. Bouchaud,et al.  Trades, Quotes and Prices: Financial Markets Under the Microscope , 2018 .

[40]  Bhavani M. Thuraisingham,et al.  Privacy Preserving Synthetic Data Release Using Deep Learning , 2018, ECML/PKDD.

[41]  Maryam Archie,et al.  Who ’ s Watching ? De-anonymization of Netflix Reviews using Amazon Reviews , 2018 .

[42]  Lalana Kagal,et al.  Explaining Explanations: An Overview of Interpretability of Machine Learning , 2018, 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA).

[43]  I. Glenn Cohen,et al.  HIPAA and Protecting Health Information in the 21st Century , 2018, JAMA.

[44]  Niaz Kammoun,et al.  Financial market reaction to cyberattacks , 2019, Cogent Economics & Finance.

[45]  Jie Chen,et al.  Time Series Simulation by Conditional Generative Adversarial Net , 2019, International Journal of Neural Networks and Advanced Applications.

[46]  Maria Hybinette,et al.  ABIDES: Towards High-Fidelity Market Simulation for AI Research , 2019, ArXiv.

[47]  Logan Kugler Protecting the 2020 census , 2019, Commun. ACM.

[48]  Pascal Van Hentenryck,et al.  OptStream: Releasing Time Series Privately , 2019, J. Artif. Intell. Res..

[49]  C. Hoofnagle,et al.  The European Union general data protection regulation: what it is and what it means* , 2019, Information & Communications Technology Law.

[50]  Donovan Platt,et al.  A Comparison of Economic Agent-Based Model Calibration Methods , 2019, Journal of Economic Dynamics and Control.

[51]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[52]  Tom Goldstein,et al.  Are adversarial examples inevitable? , 2018, ICLR.

[53]  Michael P. Wellman,et al.  Generating Realistic Stock Market Order Streams , 2020, AAAI.