论文信息 - Text-to-Image Diffusion Models are Zero-Shot Classifiers

Text-to-Image Diffusion Models are Zero-Shot Classifiers

The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge and comparing them with CLIP's zero-shot abilities. They perform competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, they achieve state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision-language tasks.

Kevin Clark | P. Jaini

[1] Alexander C. Li,et al. Your Diffusion Model is Secretly a Zero-Shot Classifier , 2023, ArXiv.

[2] Jiwen Lu,et al. Unleashing Text-to-Image Diffusion Models for Visual Perception , 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV).

[3] Sjoerd van Steenkiste,et al. Scaling Vision Transformers to 22 Billion Parameters , 2023, ICML.

[4] Nihal V. Nayak,et al. Does CLIP Bind Concepts? Probing Compositionality in Large Image Models , 2022, FINDINGS.

[5] William Yang Wang,et al. Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis , 2022, ICLR.

[6] M. Ryoo,et al. Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors , 2022, ArXiv.

[7] Alexei A. Efros,et al. Visual Prompting via Image Inpainting , 2022, NeurIPS.

[8] Jing Yu Koh,et al. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , 2022, Trans. Mach. Learn. Res..

[9] Emmanuel Asiedu Brempong,et al. Denoising Pretraining for Semantic Segmentation , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[10] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[11] Anima Anandkumar,et al. Diffusion Models for Adversarial Purification , 2022, ICML.

[12] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[13] Trevor Darrell,et al. ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension , 2022, ACL.

[14] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16] Huidong Liu,et al. CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification , 2021, ArXiv.

[17] Lu Yuan,et al. Florence: A New Foundation Model for Computer Vision , 2021, ArXiv.

[18] Ross B. Girshick,et al. Masked Autoencoders Are Scalable Vision Learners , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] David J. Fleet,et al. Palette: Image-to-Image Diffusion Models , 2021, SIGGRAPH.

[20] Jenia Jitsev,et al. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs , 2021, ArXiv.

[21] Diederik P. Kingma,et al. Variational Diffusion Models , 2021, ArXiv.

[22] Alexander Kolesnikov,et al. Scaling Vision Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23] David J. Fleet,et al. Cascaded Diffusion Models for High Fidelity Image Generation , 2021, J. Mach. Learn. Res..

[24] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[25] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[26] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[27] Abhishek Kumar,et al. Score-Based Generative Modeling through Stochastic Differential Equations , 2020, ICLR.

[28] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[29] Pieter Abbeel,et al. Denoising Diffusion Probabilistic Models , 2020, NeurIPS.

[30] Stefano Ermon,et al. Improved Techniques for Training Score-Based Generative Models , 2020, NeurIPS.