Compositional Text-to-Image Synthesis with Attention Map Control of Diffusion Models

Recent text-to-image (T2I) diffusion models show outstanding performance in generating high-quality images conditioned on textual prompts. However, these models fail to semantically align the generated images with the text descriptions due to their limited compositional capabilities, leading to attribute leakage, entity leakage, and missing entities. In this paper, we propose a novel attention mask control strategy based on predicted object boxes to address these three issues. In particular, we first train a BoxNet to predict a box for each entity that possesses the attribute specified in the prompt. Then, depending on the predicted boxes, unique mask control is applied to the cross- and self-attention maps. Our approach produces a more semantically accurate synthesis by constraining the attention regions of each token in the prompt to the image. In addition, the proposed method is straightforward and effective, and can be readily integrated into existing cross-attention-diffusion-based T2I generators. We compare our approach to competing methods and demonstrate that it not only faithfully conveys the semantics of the original text to the generated content, but also achieves high availability as a ready-to-use plugin.

[1]  Yang Zhang,et al.  Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  A. Vedaldi,et al.  Training-Free Layout Control with Cross-Attention Guidance , 2023, ArXiv.

[3]  Dimitris N. Metaxas,et al.  SVDiff: Compact Parameter Space for Diffusion Fine-Tuning , 2023, ArXiv.

[4]  Jun-Juan Zhu,et al.  Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection , 2023, ECCV.

[5]  Jingren Zhou,et al.  Cones: Concept Neurons in Diffusion Models for Customized Generation , 2023, ICML.

[6]  J. P. Lewis,et al.  Directed Diffusion: Direct Control of Object Placement through Attention Guidance , 2023, ArXiv.

[7]  Maneesh Agrawala,et al.  Adding Conditional Control to Text-to-Image Diffusion Models , 2023, ArXiv.

[8]  Á. Jiménez Mixture of Diffusers for scene composition and high resolution image generation , 2023, ArXiv.

[9]  Lior Wolf,et al.  Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models , 2023, ArXiv.

[10]  Yong Jae Lee,et al.  GLIGEN: Open-Set Grounded Text-to-Image Generation , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Radu Tudor Ionescu,et al.  Diffusion Models in Vision: A Survey , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  William Yang Wang,et al.  Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[13]  Dong Huk Park,et al.  Shape-Guided Diffusion with Inside-Outside Attention , 2022, ArXiv.

[14]  Bryan Catanzaro,et al.  eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers , 2022, ArXiv.

[15]  Ming-Hsuan Yang,et al.  Diffusion Models: A Comprehensive Survey of Methods and Applications , 2022, ACM Computing Surveys.

[16]  J. Tenenbaum,et al.  Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[17]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[18]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[19]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Fu-En Yang,et al.  LayoutTransformer: Scene Layout Generation with Conceptual and Spatial Diversity , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[22]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[23]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[25]  Silvio Savarese,et al.  Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[27]  Carl Doersch,et al.  Tutorial on Variational Autoencoders , 2016, ArXiv.

[28]  Andrew Y. Ng,et al.  End-to-End People Detection in Crowded Scenes , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[30]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.