Inspecting the Geographical Representativeness of Images from Text-to-Image Models

Recent progress in generative models has resulted in models that produce both realistic as well as relevant images for most textual inputs. These models are being used to generate millions of images everyday, and hold the potential to drastically impact areas such as generative art, digital marketing and data augmentation. Given their outsized impact, it is important to ensure that the generated content reflects the artifacts and surroundings across the globe, rather than over-representing certain parts of the world. In this paper, we measure the geographical representativeness of common nouns (e.g., a house) generated through DALL.E 2 and Stable Diffusion models using a crowdsourced study comprising 540 participants across 27 countries. For deliberately underspecified inputs without country names, the generated images most reflect the surroundings of the United States followed by India, and the top generations rarely reflect surroundings from all other countries (average score less than 3 out of 5). Specifying the country names in the input increases the representativeness by 1.44 points on average for DALL.E 2 and 0.75 for Stable Diffusion, however, the overall scores for many countries still remain low, highlighting the need for future models to be more geographically inclusive. Lastly, we examine the feasibility of quantifying the geographical representativeness of generated images without conducting user studies.

[1]  William Yang Wang,et al.  Multilingual Conceptual Coverage in Text-to-Image Models , 2023, ACL.

[2]  James Y. Zou,et al.  Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale , 2022, FAccT.

[3]  Kai-Wei Chang,et al.  How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions? , 2022, EMNLP.

[4]  S. Clémençon,et al.  Mitigating Gender Bias in Face Recognition using the von Mises-Fisher Mixture Model , 2022, ICML.

[5]  Jason Baldridge,et al.  Underspecification in Scene Description-to-Depiction Tasks , 2022, AACL.

[6]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[8]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[9]  Aylin Caliskan,et al.  Markedness in Visual Semantic AI , 2022, FAccT.

[10]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[11]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[13]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Olga Russakovsky,et al.  Understanding and Evaluating Racial Biases in Image Captioning , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[15]  Kai-Wei Chang,et al.  Societal Biases in Language Generation: Progress and Challenges , 2021, ACL.

[16]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[17]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[18]  Lalana Kagal,et al.  Investigating Bias in Image Classification using Model Explanations , 2020, ArXiv.

[19]  Sunnie S. Y. Kim,et al.  Fair Attribute Classification through Latent Space De-biasing , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Aylin Caliskan,et al.  Image Representations Learned With Unsupervised Pre-Training Contain Human-like Biases , 2020, FAccT.

[21]  Ramya Srinivasan,et al.  Biases in Generative Art: A Causal Look from the Lens of Art History , 2020, FAccT.

[22]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[23]  Jiasen Lu,et al.  X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers , 2020, EMNLP.

[24]  N. Sebe,et al.  DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis , 2020, ArXiv.

[25]  Yi Zhang,et al.  Towards Accuracy-Fairness Paradox: Adversarial Example-based Data Augmentation for Visual Debiasing , 2020, ACM Multimedia.

[26]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[27]  Jianfeng Gao,et al.  Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.

[28]  Andrei Barbu,et al.  Measuring Social Biases in Grounded Vision and Language Embeddings , 2020, NAACL.

[29]  Nanyun Peng,et al.  The Woman Worked as a Babysitter: On Biases in Language Generation , 2019, EMNLP.

[30]  Furu Wei,et al.  VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.

[31]  Stefan Lee,et al.  ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[32]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[34]  Trevor Darrell,et al.  Women also Snowboard: Overcoming Bias in Captioning Models , 2018, ECCV.

[35]  Timnit Gebru,et al.  Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification , 2018, FAT.

[36]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[37]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[41]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[42]  Aaron B. Adcock,et al.  Beyond web-scraping: Crowd-sourcing a geographically diverse image dataset , 2023, arXiv.org.

[43]  Lukas Struppek,et al.  The Biased Artist: Exploiting Cultural Biases via Homoglyphs in Text-Guided Image Generation Models , 2022, ArXiv.

[44]  Mohit Bansal,et al.  DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers , 2022, ArXiv.

[45]  Michelle A. Gan,et al.  An Image of Society: Gender and Racial Representation and Impact in Image Search Results for Occupations , 2021 .

[46]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .