Blended Latent Diffusion

The tremendous progress in neural image generation, coupled with the emergence of seemingly omnipotent vision-language models has finally enabled text-based interfaces for creating and editing images. Handling generic images requires a diverse underlying generative model, hence the latest works utilize diffusion models, which were shown to surpass GANs in terms of diversity. One major drawback of diffusion models, however, is their relatively slow inference time. In this paper, we present an accelerated solution to the task of local text-driven editing of generic images, where the desired edits are confined to a user-provided mask. Our solution leverages a recent text-to-image Latent Diffusion Model (LDM), which speeds up diffusion by operating in a lower-dimensional latent space. We first convert the LDM into a local image editor by incorporating Blended Diffusion into it. Next we propose an optimization-based solution for the inherent inability of this LDM to accurately reconstruct images. Finally, we address the scenario of performing local edits using thin masks. We evaluate our method against the available baselines both qualitatively and quantitatively and demonstrate that in addition to being faster, our method achieves better precision than the baselines while mitigating some of their artifacts.

[1]  Daniel Cohen-Or,et al.  Pivotal Tuning for Latent-based Editing of Real Images , 2021, ACM Trans. Graph..

[2]  Ming-Hsuan Yang,et al.  GAN Inversion: A Survey , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Yaniv Taigman,et al.  SpaText: Spatio-Textual Representation for Controllable Image Generation , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[5]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[6]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[7]  Stella Rose Biderman,et al.  VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.

[8]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9]  Yaniv Taigman,et al.  KNN-Diffusion: Image Generation via Large-Scale Retrieval , 2022, ICLR.

[10]  Tali Dekel,et al.  Text2LIVE: Text-Driven Layered Image and Video Editing , 2022, ECCV.

[11]  Yaniv Taigman,et al.  Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors , 2022, ECCV.

[12]  Zili Yi,et al.  CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP , 2022, ArXiv.

[13]  L. Gool,et al.  RePaint: Inpainting using Denoising Diffusion Probabilistic Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Daniel Cohen-Or,et al.  Stitch it in Time: GAN-Based Facial Editing of Real Videos , 2022, SIGGRAPH Asia.

[15]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Tat-Jen Cham,et al.  Global Context with Discrete Diffusion in Vector Quantised Modelling for Image Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  D. Lischinski,et al.  Blended Diffusion for Text-driven Editing of Natural Images , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Ruiyi Zhang,et al.  Towards Language-Free Training for Text-to-Image Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Toby P. Breckon,et al.  Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes , 2021, ECCV.

[22]  S. Ermon,et al.  SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations , 2021, ICLR.

[23]  Bo Dai,et al.  Exploiting Deep Generative Prior for Versatile Image Restoration and Manipulation , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[25]  Andreas Blattmann,et al.  ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis , 2021, NeurIPS.

[26]  Youngjune Gwon,et al.  ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Daniel Cohen-Or,et al.  StyleGAN-NADA , 2021, ACM Trans. Graph..

[28]  Jan Kautz,et al.  Score-based Generative Modeling in Latent Space , 2021, NeurIPS.

[29]  Chang Zhou,et al.  CogView: Mastering Text-to-Image Generation via Transformers , 2021, NeurIPS.

[30]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[31]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[32]  A. Torralba,et al.  Paint by Word , 2021, ArXiv.

[33]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[34]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[35]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[36]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[38]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[39]  Jong Chul Ye,et al.  DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models , 2021, ArXiv.

[40]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[41]  Bolei Zhou,et al.  In-Domain GAN Inversion for Real Image Editing , 2020, ECCV.

[42]  Maneesh Agrawala,et al.  Editing self-image , 2020, Commun. ACM.

[43]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Peter Wonka,et al.  Image2StyleGAN++: How to Edit the Embedded Images? , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[46]  Bolei Zhou,et al.  Semantic photo manipulation with a generative image prior , 2019, ACM Trans. Graph..

[47]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[48]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[49]  Peter Wonka,et al.  Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[50]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[52]  Xiaogang Wang,et al.  StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[53]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[55]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[56]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[57]  Lilly Irani,et al.  Amazon Mechanical Turk , 2018, Advances in Intelligent Systems and Computing.

[58]  Dimitris N. Metaxas,et al.  StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[59]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[60]  Ruslan Salakhutdinov,et al.  Generating Images from Captions with Attention , 2015, ICLR.

[61]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[62]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[63]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[64]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[65]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Patrick Pérez,et al.  Poisson image editing , 2003, ACM Trans. Graph..

[67]  Michael I. Jordan,et al.  Advances in Neural Information Processing Systems 30 , 1995 .