Imagining an Engineer: On GAN-Based Data Augmentation Perpetuating Biases

The use of synthetic data generated by Generative Adversarial Networks (GANs) has become quite a popular method to do data augmentation for many applications. While practitioners celebrate this as an economical way to get more synthetic data that can be used to train downstream classifiers, it is not clear that they recognize the inherent pitfalls of this technique. In this paper, we aim to exhort practitioners against deriving any false sense of security against data biases based on data augmentation. To drive this point home, we show that starting with a dataset consisting of head-shots of engineering researchers, GAN-based augmentation "imagines" synthetic engineers, most of whom have masculine features and white skin color (inferred from a human subject study conducted on Amazon Mechanical Turk). This demonstrates how biases inherent in the training data are reinforced, and sometimes even amplified, by GAN-based data augmentation; it should serve as a cautionary tale for the lay practitioners.

[1]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[2]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[3]  Yuval Elovici,et al.  DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[4]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[5]  Soumith Chintala,et al.  Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks , 2015, ICLR.

[6]  Nassir Navab,et al.  MelanoGANs: High Resolution Skin Lesion Synthesis with GANs , 2018, ArXiv.

[7]  Hayit Greenspan,et al.  Synthetic data augmentation using GAN for improved liver lesion classification , 2018, 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018).

[9]  Yuemin Bian,et al.  Deep Learning for Drug Design: an Artificial Intelligence Paradigm for Drug Discovery in the Big Data Era , 2018, The AAPS Journal.

[10]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[11]  R. Light Measures of response agreement for qualitative data: Some generalizations and alternatives. , 1971 .

[12]  David Cox,et al.  Conditional Infilling GANs for Data Augmentation in Mammogram Classification , 2018, RAMBO+BIA+TIA@MICCAI.

[13]  Jimeng Sun,et al.  Generating Multi-label Discrete Patient Records using Generative Adversarial Networks , 2017, MLHC.

[14]  Peter König,et al.  Further Advantages of Data Augmentation on Convolutional Neural Networks , 2018, ICANN.