Transitioning from Real to Synthetic data: Quantifying the bias in model

With advent of generative modelling techniques, synthetic data and its use has penetrated across various domains from unstructured data such as image, text to structured dataset modelling healthcare outcome, risk decisioning in financial domain and many more. It overcomes various challenges such as limited training data, class imbalance, restricted access to dataset owing to privacy issues. To ensure trained model used for automated decisioning purposes makes fair decision there exist prior work to quantify and mitigate those issues. This study aims to establish trade-off between bias and fairness in the models trained using synthetic data. Variants of synthetic data generation techniques were studied to understand bias amplification including differentially private generation schemes. Through experiments on a tabular dataset, we demonstrate there exist varying level of bias impact on models trained using synthetic data. Techniques generating less correlated feature performs well as evident through fairness metrics with 94%, 82%, and 88% relative drop in DPD (demographic parity difference), EoD (equality of odds) and EoP (equality of opportunity) respectively, and 24% relative improvement in DRP (demographic parity ratio) with respect to the real dataset. We believe the outcome of our research study will help data science practitioners understand the bias in use of synthetic data.

[1]  Kristina Lerman,et al.  A Survey on Bias and Fairness in Machine Learning , 2019, ACM Comput. Surv..

[2]  Rachel K. E. Bellamy,et al.  AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias , 2018, ArXiv.

[3]  Lei Xu,et al.  Modeling Tabular data using Conditional GAN , 2019, NeurIPS.

[4]  Nathan Srebro,et al.  Equality of Opportunity in Supervised Learning , 2016, NIPS.

[5]  Mihaela van der Schaar,et al.  PATE-GAN: Generating Synthetic Data with Differential Privacy Guarantees , 2018, ICLR.

[6]  Krishna P. Gummadi,et al.  iFair: Learning Individually Fair Data Representations for Algorithmic Decision Making , 2018, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[7]  Julia Rubin,et al.  Fairness Definitions Explained , 2018, 2018 IEEE/ACM International Workshop on Software Fairness (FairWare).

[8]  Yoav Goldberg,et al.  Adversarial Removal of Demographic Attributes from Text Data , 2018, EMNLP.

[9]  Jieyu Zhao,et al.  Balanced Datasets Are Not Enough: Estimating and Mitigating Gender Bias in Deep Image Representations , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Vitaly Shmatikov,et al.  Differential Privacy Has Disparate Impact on Model Accuracy , 2019, NeurIPS.

[11]  Tom LaGatta,et al.  Conscientious Classification: A Data Scientist's Guide to Discrimination-Aware Classification , 2017, Big Data.

[12]  Seth Neel,et al.  A Convex Framework for Fair Regression , 2017, ArXiv.

[13]  Emily Denton,et al.  Characterising Bias in Compressed Models , 2020, ArXiv.

[14]  C. Varin,et al.  Gaussian Copula Marginal Regression , 2012 .