The Hidden Language of Diffusion Models

Text-to-image diffusion models have demonstrated an unparalleled ability to generate high-quality, diverse images from a textual concept (e.g.,"a doctor","love"). However, the internal process of mapping text to a rich visual representation remains an enigma. In this work, we tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as"a president"or"a composer"are dominated by specific instances (e.g.,"Obama","Biden") and their interpolations. Other concepts, such as"happiness"combine associated terms that can be concrete ("family","laughter") or abstract ("friendship","emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation. Our code will be available at: https://hila-chefer.github.io/Conceptor/

[1]  A. Globerson,et al.  Dissecting Recall of Factual Associations in Auto-Regressive Language Models , 2023, ArXiv.

[2]  Chen Henry Wu,et al.  Zero-Shot Model Diagnosis , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yacine Jernite,et al.  Stable Bias: Analyzing Societal Representations in Diffusion Models , 2023, ArXiv.

[4]  Dimitris N. Metaxas,et al.  SVDiff: Compact Parameter Space for Diffusion Fine-Tuning , 2023, ArXiv.

[5]  T. Goldstein,et al.  Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt Tuning and Discovery , 2023, ArXiv.

[6]  A. Torralba,et al.  Debiasing Vision-Language Models via Biased Prompts , 2023, ArXiv.

[7]  Lior Wolf,et al.  Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models , 2023, ACM Trans. Graph..

[8]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ICML.

[9]  Florian Tramèr,et al.  Extracting Training Data from Diffusion Models , 2023, USENIX Security Symposium.

[10]  A. Globerson,et al.  What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary , 2022, ACL.

[11]  Nupur Kumari,et al.  Multi-Concept Customization of Text-to-Image Diffusion , 2022, ArXiv.

[12]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[13]  David Bau,et al.  Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task , 2022, ICLR.

[14]  Yoav Goldberg,et al.  DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models , 2022, BLACKBOXNLP.

[15]  M. Irani,et al.  Imagic: Text-Based Real Image Editing with Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[18]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[19]  Ellie Pavlick,et al.  Unit Testing for Concepts in Neural Networks , 2022, Transactions of the Association for Computational Linguistics.

[20]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[21]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[22]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[23]  Chen Sun,et al.  Do Vision-Language Pretrained Models Learn Composable Primitive Concepts? , 2022, Trans. Mach. Learn. Res..

[24]  Yoav Goldberg,et al.  Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , 2022, EMNLP.

[25]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[26]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[27]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[29]  Rajshekhar Sunderraman,et al.  Improving Text-to-Image Synthesis Using Contrastive Learning , 2021, BMVC.

[30]  Jacob Andreas,et al.  Implicit Representations of Meaning in Neural Language Models , 2021, ACL.

[31]  Lior Wolf,et al.  Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[33]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[34]  Jing Yu Koh,et al.  Cross-Modal Contrastive Learning for Text-to-Image Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Lior Wolf,et al.  Transformer Interpretability Beyond Attention Visualization , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Xiaoyuan Jing,et al.  DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[38]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[39]  Yizhen Zhang,et al.  Connecting concepts in the brain by mapping cortical representations of semantic relations , 2019, bioRxiv.

[40]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Alexandra Chouldechova,et al.  Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting , 2019, FAT.

[42]  Abien Fred Agarap Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.

[43]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[46]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[48]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[49]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[50]  M. Kiefer,et al.  Conceptual representations in mind and brain: Theoretical developments, current evidence and future directions , 2012, Cortex.

[51]  J. Fodor,et al.  Connectionism and cognitive architecture: A critical analysis , 1988, Cognition.

[52]  Ellie Pavlick,et al.  Mapping Language Models to Grounded Conceptual Spaces , 2022, ICLR.

[53]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .