LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that only explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.

[1]  Zhe Gan,et al.  ReCo: Region-Controlled Text-to-Image Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Yu-Chiang Frank Wang,et al.  Frido: Feature Pyramid Diffusion for Complex Scene Image Synthesis , 2022, AAAI.

[3]  Yuanzhen Li,et al.  DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Amit H. Bermano,et al.  An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion , 2022, ICLR.

[5]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[6]  J. Yang,et al.  Modeling Image Composition for Complex Scene Generation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[8]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[10]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[13]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Z. Li,et al.  Image Synthesis from Layout with Locality-Aware Mask Adaption , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[16]  Robin Rombach,et al.  High-Resolution Complex Scene Synthesis with Transformers , 2021, ArXiv.

[17]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[18]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Bodo Rosenhahn,et al.  Context-Aware Layout to Image Generation with Enhanced Object Appearance , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  L. Sigal,et al.  Energy-Based Learning for Scene Graph Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Shitong Luo,et al.  Diffusion Probabilistic Models for 3D Point Cloud Generation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[25]  Tianfu Wu,et al.  Learning Layout and Style Reconfigurable GANs for Controllable Image Synthesis , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  R Devon Hjelm,et al.  Object-Centric Image Generation from Layouts , 2020, AAAI.

[27]  Wei Sun,et al.  Image Synthesis From Reconfigurable Layout and Style , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Suman V. Ravuri,et al.  Classification Accuracy Score for Conditional Generative Models , 2019, NeurIPS.

[29]  Bo Zhao,et al.  Image Generation From Layout , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[32]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[33]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[35]  Michael S. Bernstein,et al.  Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[39]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[40]  Charles W. Melnyk,et al.  Plant grafting , 2015, Current Biology.

[41]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[42]  Eliezer E. Goldschmidt,et al.  A History of Grafting , 2009 .

[43]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  S. Hochreiter,et al.  Long Short-Term Memory , 1997, Neural Computation.