Scalable Diffusion Models with Transformers

We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops -- through increased transformer depth/width or increased number of input tokens -- consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter.

[1]  David J. Fleet,et al.  Scalable Adaptive Computation for Iterative Generation , 2022, arXiv.org.

[2]  Alexei A. Efros,et al.  Learning to Learn with Generative Models of Neural Network Checkpoints , 2022, ArXiv.

[3]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[4]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[5]  Junyan Zhu,et al.  On Aliased Resizing and Surprising Subtleties in GAN Evaluation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Tero Karras,et al.  Elucidating the Design Space of Diffusion-Based Generative Models , 2022, NeurIPS.

[7]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[8]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[9]  W. Freeman,et al.  MaskGIT: Masked Generative Image Transformer , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Andreas Geiger,et al.  StyleGAN-XL: Scaling StyleGAN to Large Diverse Datasets , 2022, SIGGRAPH.

[11]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[13]  Fang Wen,et al.  Vector Quantized Diffusion Model for Text-to-Image Synthesis , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[16]  Trevor Darrell,et al.  Early Convolutions Help Transformers See Better , 2021, NeurIPS.

[17]  Jakob Uszkoreit,et al.  How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, Trans. Mach. Learn. Res..

[18]  Sergey Levine,et al.  Offline Reinforcement Learning as One Big Sequence Modeling Problem , 2021, NeurIPS.

[19]  Pieter Abbeel,et al.  Decision Transformer: Reinforcement Learning via Sequence Modeling , 2021, NeurIPS.

[20]  Prafulla Dhariwal,et al.  Diffusion Models Beat GANs on Image Synthesis , 2021, NeurIPS.

[21]  Sander Dieleman,et al.  Generating Images with Sparse Representations , 2021, ICML.

[22]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[23]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[24]  Prafulla Dhariwal,et al.  Improved Denoising Diffusion Probabilistic Models , 2021, ICML.

[25]  B. Ommer,et al.  Taming Transformers for High-Resolution Image Synthesis , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[27]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[28]  Mark Chen,et al.  Scaling Laws for Autoregressive Generative Modeling , 2020, ArXiv.

[29]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[30]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Kaiming He,et al.  Designing Network Design Spaces , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[34]  Yang Song,et al.  Generative Modeling by Estimating Gradients of the Data Distribution , 2019, NeurIPS.

[35]  Saining Xie,et al.  On Network Design Spaces for Visual Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Ilya Sutskever,et al.  Generating Long Sequences with Sparse Transformers , 2019, ArXiv.

[37]  Jaakko Lehtinen,et al.  Improved Precision and Recall Metric for Assessing Generative Models , 2019, NeurIPS.

[38]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Jeff Donahue,et al.  Large Scale GAN Training for High Fidelity Natural Image Synthesis , 2018, ICLR.

[40]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[41]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[42]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[43]  Dustin Tran,et al.  Image Transformer , 2018, ICML.

[44]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[45]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[46]  Oriol Vinyals,et al.  Neural Discrete Representation Learning , 2017, NIPS.

[47]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[48]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[49]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[50]  Diederik P. Kingma,et al.  PixelCNN++: Improving the PixelCNN with Discretized Logistic Mixture Likelihood and Other Modifications , 2017, ICLR.

[51]  Alexei A. Efros,et al.  Image-to-Image Translation with Conditional Adversarial Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[53]  Alex Graves,et al.  Conditional Image Generation with PixelCNN Decoders , 2016, NIPS.

[54]  Wojciech Zaremba,et al.  Improved Techniques for Training GANs , 2016, NIPS.

[55]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[57]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[58]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[59]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[60]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[61]  Aapo Hyvärinen,et al.  Estimation of Non-Normalized Statistical Models by Score Matching , 2005, J. Mach. Learn. Res..