Retrieval-Augmented Diffusion Models

Generative image synthesis with diffusion models has recently achieved excellent visual quality in several tasks such as text-based or class-conditional image synthesis. Much of this success is due to a dramatic increase in the computational capacity invested in training these models. This work presents an alternative approach: inspired by its successful application in natural language processing, we propose to complement the diffusion model with a retrieval-based approach and to introduce an explicit memory in the form of an external database. During training, our diffusion model is trained with similar visual features retrieved via CLIP and from the neighborhood of each training instance. By leveraging CLIP’s joint image-text embedding space, our model achieves highly competitive performance on tasks for which it has not been explicitly trained, such as class-conditional or text-image synthesis, and can be conditioned on both text and image embeddings. Moreover, we can apply our approach to unconditional generation, where it achieves state-of-the-art performance. Our approach incurs low computational and memory overheads and is easy to implement. We discuss its relationship to concurrent work and will publish code and pretrained models soon.

[1]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[2]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[3]  Stella Rose Biderman,et al.  VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance , 2022, ECCV.

[4]  E. Shechtman,et al.  Any-resolution Training for High-resolution Image Synthesis , 2022, ECCV.

[5]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[6]  David J. Fleet,et al.  Video Diffusion Models , 2022, NeurIPS.

[7]  Yaniv Taigman,et al.  KNN-Diffusion: Image Generation via Large-Scale Retrieval , 2022, ICLR.

[8]  Victor Garcia Satorras,et al.  Equivariant Diffusion for Molecule Generation in 3D , 2022, ICML.

[9]  S. Mandt,et al.  Diffusion Probabilistic Modeling for Video Generation , 2022, Entropy.

[10]  Markus N. Rabe,et al.  Memorizing Transformers , 2022, ICLR.

[11]  Tero Karras,et al.  The Role of ImageNet Classes in Fréchet Inception Distance , 2022, ICLR.

[12]  S. Ermon,et al.  GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation , 2022, ICLR.

[13]  Chunhua Shen,et al.  Retrieval Augmented Classification for Long-Tail Visual Recognition , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Yi Ren,et al.  Pseudo Numerical Methods for Diffusion Models on Manifolds , 2022, ICLR.

[15]  Andreas Geiger,et al.  StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets , 2022, SIGGRAPH.

[16]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[18]  Karsten Kreis,et al.  Tackling the Generative Learning Trilemma with Denoising Diffusion GANs , 2021, ICLR.

[19]  Diego de Las Casas,et al.  Improving language models by retrieving from trillions of tokens , 2021, ICML.

[20]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Ruiyi Zhang,et al.  Towards Language-Free Training for Text-to-Image Generation , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[23]  Andreas Geiger,et al.  Projected GANs Converge Faster , 2021, NeurIPS.

[24]  Tianwei Zhang,et al.  GNN-LM: Language Modeling based on Global Contexts via GNN , 2021, ICLR.

[25]  Jing Yu Koh,et al.  Vector-quantized Image Modeling with Improved VQGAN , 2021, ICLR.

[26]  Michal Drozdzal,et al.  Instance-Conditioned GAN , 2021, NeurIPS.

[27]  Andreas Blattmann,et al.  ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis , 2021, NeurIPS.

[28]  Timo Milbich,et al.  iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[30]  Jaakko Lehtinen,et al.  Alias-Free Generative Adversarial Networks , 2021, NeurIPS.

[31]  Stefano Ermon,et al.  D2C: Diffusion-Decoding Models for Few-Shot Conditional Generation , 2021, NeurIPS.

[32]  Jan Kautz,et al.  Score-based Generative Modeling in Latent Space , 2021, NeurIPS.

[33]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[34]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[35]  David J. Fleet,et al.  Image Super-Resolution via Iterative Refinement , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Angela Dai,et al.  RetrievalFuse: Neural 3D Scene Reconstruction with a Database , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[38]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[39]  Eric Luhman,et al.  Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed , 2021, ArXiv.

[40]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Bjorn Ommer,et al.  A Note on Data Biases in Generative Models , 2020, ArXiv.

[42]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[43]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[44]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[45]  Mike Lewis,et al.  Nearest Neighbor Machine Translation , 2020, ICLR.

[46]  Rui Xu,et al.  Texture Memory-Augmented Deep Patch-Based Image Inpainting , 2020, IEEE Transactions on Image Processing.

[47]  Bryan Catanzaro,et al.  DiffWave: A Versatile Diffusion Model for Audio Synthesis , 2020, ICLR.

[48]  Hsin-Ying Lee,et al.  RetrieveGAN: Image Synthesis via Differentiable Patch Retrieval , 2020, ECCV.

[49]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[50]  Jan Kautz,et al.  NVAE: A Deep Hierarchical Variational Autoencoder , 2020, NeurIPS.

[51]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[52]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[53]  B. Ommer,et al.  Network-to-Network Translation with Conditional Invertible Neural Networks , 2020, NeurIPS.

[54]  Ming-Wei Chang,et al.  REALM: Retrieval-Augmented Language Model Pre-Training , 2020, ICML.

[55]  Sailik Sengupta,et al.  Imperfect ImaGANation: Implications of GANs Exacerbating Biases on Facial Data Augmentation and Snapchat Selfie Lenses , 2020, ArXiv.

[56]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[57]  Tero Karras,et al.  Analyzing and Improving the Image Quality of StyleGAN , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Omer Levy,et al.  Generalization through Memorization: Nearest Neighbor Language Models , 2019, ICLR.

[59]  Sanjiv Kumar,et al.  Accelerating Large-Scale Inference with Anisotropic Vector Quantization , 2019, ICML.

[60]  A. Waldman,et al.  Sex, Lies, and Videotape: Deep Fakes and Free Speech Delusions , 2019 .

[61]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[62]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[63]  Jaakko Lehtinen,et al.  Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.

[64]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[66]  Prafulla Dhariwal,et al.  Glow: Generative Flow with Invertible 1x1 Convolutions , 2018, NeurIPS.

[67]  Radu Soricut,et al.  Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.

[68]  Yann Dauphin,et al.  Hierarchical Neural Story Generation , 2018, ACL.

[69]  Sebastian Nowozin,et al.  Which Training Methods for GANs do actually Converge? , 2018, ICML.

[70]  Ashish Khetan,et al.  PacGAN: The Power of Two Samples in Generative Adversarial Networks , 2017, IEEE Journal on Selected Areas in Information Theory.

[71]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[72]  Yongxin Yang,et al.  Deeper, Broader and Artier Domain Generalization , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[73]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[74]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[75]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[76]  Sebastian Nowozin,et al.  The Numerics of GANs , 2017, NIPS.

[77]  Charles A. Sutton,et al.  VEEGAN: Reducing Mode Collapse in GANs using Implicit Variational Learning , 2017, NIPS.

[78]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[79]  Diederik P. Kingma,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[80]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[81]  David Pfau,et al.  Unrolled Generative Adversarial Networks , 2016, ICLR.

[82]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016, 1606.08415.

[83]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[84]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[85]  Samy Bengio,et al.  Density estimation using Real NVP , 2016, ICLR.

[86]  Tianqi Chen,et al.  Training Deep Nets with Sublinear Memory Cost , 2016, ArXiv.

[87]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[88]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[89]  Babak Saleh,et al.  Large-scale Classification of Fine-Art Paintings: Learning The Right Metric on The Right Feature , 2015, ArXiv.

[90]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[91]  Yoshua Bengio,et al.  NICE: Non-linear Independent Components Estimation , 2014, ICLR.

[92]  Aaron C. Courville,et al.  Generative adversarial networks , 2014, Commun. ACM.

[93]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[94]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[95]  Alexei A. Efros,et al.  Unbiased look at dataset bias , 2011, CVPR 2011.

[96]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[97]  N. Whitman A bitter lesson. , 1999, Academic medicine : journal of the Association of American Medical Colleges.