Cache Me if You Can: Accelerating Diffusion Models through Block Caching

Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However, one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps, they generally treat the underlying denoising network as a black box. In this work, we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time, 2) the layers show distinct patterns of change, and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this, we introduce block caching, in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore, we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments, we show through FID, human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM).

[1]  Ali K. Thabet,et al.  Bespoke Solvers for Generative Flow Models , 2023, ArXiv.

[2]  Simian Luo,et al.  Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference , 2023, ArXiv.

[3]  Enze Xie,et al.  PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis , 2023, ArXiv.

[4]  Anthony Hu,et al.  GAIA-1: A Generative World Model for Autonomous Driving , 2023, ArXiv.

[5]  D. Mahajan,et al.  Emu: Enhancing Image Generation Models Using Photogenic Needles in a Haystack , 2023, ArXiv.

[6]  Yongxin Chen,et al.  Improved Order Analysis and Design of Exponential Integrator for Diffusion Models Sampling , 2023, ArXiv.

[7]  Tim Dockhorn,et al.  SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis , 2023, ArXiv.

[8]  Yu Liu,et al.  RAPHAEL: Text-to-Image Generation via Large Mixture of Diffusion Paths , 2023, NeurIPS.

[9]  Jun Huang,et al.  Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models , 2023, CIKM.

[10]  Sylvain Paris,et al.  Scaling up GANs for Text-to-Image Synthesis , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Prafulla Dhariwal,et al.  Consistency Models , 2023, ICML.

[12]  Jiwen Lu,et al.  UniPC: A Unified Predictor-Corrector Framework for Fast Sampling of Diffusion Models , 2023, NeurIPS.

[13]  Xinchao Wang,et al.  Diffusion Probabilistic Model Made Slim , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Alexei A. Efros,et al.  InstructPix2Pix: Learning to Follow Image Editing Instructions , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Cheng Lu,et al.  DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models , 2022, ArXiv.

[16]  Hua Wu,et al.  ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Karsten Kreis,et al.  GENIE: Higher-Order Denoising Diffusion Solvers , 2022, NeurIPS.

[18]  Diederik P. Kingma,et al.  On Distillation of Guided Diffusion Models , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yaniv Taigman,et al.  Make-A-Video: Text-to-Video Generation without Text-Video Data , 2022, ICLR.

[20]  Ben Poole,et al.  DreamFusion: Text-to-3D using 2D Diffusion , 2022, ICLR.

[21]  J. Tenenbaum,et al.  Prompt-to-Prompt Image Editing with Cross Attention Control , 2022, ICLR.

[22]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[23]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[24]  J. Tenenbaum,et al.  Compositional Visual Generation with Composable Diffusion Models , 2022, ECCV.

[25]  Cheng Lu,et al.  DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps , 2022, NeurIPS.

[26]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[27]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[28]  Yongxin Chen,et al.  Fast Sampling of Diffusion Models with Exponential Integrator , 2022, ICLR.

[29]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[30]  Yi Ren,et al.  Pseudo Numerical Methods for Diffusion Models on Manifolds , 2022, ICLR.

[31]  Mohammad Norouzi,et al.  Learning Fast Samplers for Diffusion Models by Differentiating Through Sample Quality , 2022, ICLR.

[32]  W. Freeman,et al.  MaskGIT: Masked Generative Image Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[35]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Dan Su,et al.  Bilateral Denoising Diffusion Models , 2021, ArXiv.

[37]  Baoyuan Wu,et al.  TediGAN: Text-Guided Diverse Face Image Generation and Manipulation , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[39]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[40]  Eric Luhman,et al.  Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed , 2021, ArXiv.

[41]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[43]  Xiaoyuan Jing,et al.  DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis , 2020, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[45]  Philip H. S. Torr,et al.  Controllable Text-to-Image Generation , 2019, NeurIPS.

[46]  Ali Razavi,et al.  Generating Diverse High-Fidelity Images with VQ-VAE-2 , 2019, NeurIPS.

[47]  Wei Chen,et al.  DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-To-Image Synthesis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[48]  Jing Zhang,et al.  MirrorGAN: Learning Text-To-Image Generation by Redescription , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[50]  Han Zhang,et al.  Self-Attention Generative Adversarial Networks , 2018, ICML.

[51]  Zhe Gan,et al.  AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[53]  Bernt Schiele,et al.  Generative Adversarial Text to Image Synthesis , 2016, ICML.

[54]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[55]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[56]  Jinsung Yoon,et al.  GENERATIVE ADVERSARIAL NETS , 2018 .

[57]  Linjie Li,et al.  Improving Image Generation with Better Captions , 2022 .