Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.

[1]  Alexander C. Li,et al.  Your Diffusion Model is Secretly a Zero-Shot Classifier , 2023, ArXiv.

[2]  Jiwen Lu,et al.  Unleashing Text-to-Image Diffusion Models for Visual Perception , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Sjoerd van Steenkiste,et al.  Scaling Vision Transformers to 22 Billion Parameters , 2023, ICML.

[4]  Nihal V. Nayak,et al.  Does CLIP Bind Concepts? Probing Compositionality in Large Image Models , 2022, FINDINGS.

[5]  William Yang Wang,et al.  Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[6]  M. Ryoo,et al.  Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors , 2022, ArXiv.

[7]  Alexei A. Efros,et al.  Visual Prompting via Image Inpainting , 2022, NeurIPS.

[8]  Jing Yu Koh,et al.  Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[9]  Emmanuel Asiedu Brempong,et al.  Denoising Pretraining for Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[11]  Anima Anandkumar,et al.  Diffusion Models for Adversarial Purification , 2022, ICML.

[12]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[13]  Trevor Darrell,et al.  ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension , 2022, ACL.

[14]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16]  Huidong Liu,et al.  CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification , 2021, ArXiv.

[17]  Lu Yuan,et al.  Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[18]  Ross B. Girshick,et al.  Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  David J. Fleet,et al.  Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[20]  Jenia Jitsev,et al.  LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[21]  Diederik P. Kingma,et al.  Variational Diffusion Models , 2021, ArXiv.

[22]  Alexander Kolesnikov,et al.  Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  David J. Fleet,et al.  Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[24]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[26]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[27]  Abhishek Kumar,et al.  Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[28]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29]  Pieter Abbeel,et al.  Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[30]  Stefano Ermon,et al.  Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[33]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[34]  Mohit Bansal,et al.  LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.

[35]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[36]  Yang Yang,et al.  Deep Learning Scaling is Predictable, Empirically , 2017, ArXiv.

[37]  Chen Sun,et al.  Revisiting Unreasonable Effectiveness of Data in Deep Learning Era , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[39]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Surya Ganguli,et al.  Deep Unsupervised Learning using Nonequilibrium Thermodynamics , 2015, ICML.

[41]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[42]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[43]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Shie Mannor,et al.  PAC Bounds for Multi-armed Bandit and Markov Decision Processes , 2002, COLT.

[45]  Michael I. Jordan,et al.  On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naive Bayes , 2001, NIPS.

[46]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[47]  E. Paulson A Sequential Procedure for Selecting the Population with the Largest Mean from $k$ Normal Populations , 1964 .

[48]  Diederik P. Kingma,et al.  Understanding the Diffusion Objective as a Weighted Integral of ELBOs , 2023, ArXiv.

[49]  John C. Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .