Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion

In this work, we investigate the problem of Model-Agnostic Zero-Shot Classification (MA-ZSC), which refers to training non-specific classification architectures (downstream models) to classify real images without using any real images during training. Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC. However, the performance of this approach currently falls short of that achieved by large-scale vision-language models. One possible explanation is a potential significant domain gap between synthetic and real images. Our work offers a fresh perspective on the problem by providing initial insights that MA-ZSC performance can be improved by improving the diversity of images in the generated dataset. We propose a set of modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity, which we refer to as our $\textbf{bag of tricks}$. Our approach shows notable improvements in various classification architectures, with results comparable to state-of-the-art models such as CLIP. To validate our approach, we conduct experiments on CIFAR10, CIFAR100, and EuroSAT, which is particularly difficult for zero-shot classification due to its satellite image domain. We evaluate our approach with five classification architectures, including ResNet and ViT. Our findings provide initial insights into the problem of MA-ZSC using diffusion models. All code will be available on GitHub.

[1]  Chongxuan Li,et al.  Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels , 2023, ArXiv.

[2]  R. Salakhutdinov,et al.  Effective Data Augmentation With Diffusion Models , 2023, ArXiv.

[3]  Zachary Chase Lipton,et al.  CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets , 2023, ArXiv.

[4]  Aditya Grover,et al.  Leaving Reality to Imagination: Robust Classification via Generated Datasets , 2023, ArXiv.

[5]  Philip H. S. Torr,et al.  Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations , 2022, ArXiv.

[6]  Ankush Gupta,et al.  SuS-X: Training-Free Name-Only Transfer of Vision-Language Models , 2022, ArXiv.

[7]  Ludwig Schmidt,et al.  LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[8]  Philip H. S. Torr,et al.  Is synthetic data from generative models ready for image recognition? , 2022, ICLR.

[9]  Carl Vondrick,et al.  Visual Classification via Description from Large Language Models , 2022, ICLR.

[10]  Ali Farhadi,et al.  What does a platypus look like? Generating customized prompts for zero-shot image classification , 2022, ArXiv.

[11]  Jonathan Ho Classifier-Free Diffusion Guidance , 2022, ArXiv.

[12]  David J. Fleet,et al.  Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[13]  Prafulla Dhariwal,et al.  Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[14]  Trevor Darrell,et al.  A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Prafulla Dhariwal,et al.  GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16]  B. Ommer,et al.  High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18]  Alec Radford,et al.  Zero-Shot Text-to-Image Generation , 2021, ICML.

[19]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20]  Jiaming Song,et al.  Denoising Diffusion Implicit Models , 2020, ICLR.

[21]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[22]  Matthieu Cord,et al.  This Dataset Does Not Exist: Training Models from Generated Images , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  K. Keutzer,et al.  Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Taghi M. Khoshgoftaar,et al.  A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[25]  Quoc V. Le,et al.  Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26]  Begüm Demir,et al.  Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[27]  Chunyan Miao,et al.  A Survey of Zero-Shot Learning , 2019, ACM Trans. Intell. Syst. Technol..

[28]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[29]  Andreas Dengel,et al.  EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[32]  Sepp Hochreiter,et al.  GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[33]  Wojciech Zaremba,et al.  Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[34]  Bernt Schiele,et al.  Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37]  Christoph H. Lampert,et al.  Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  James Hays,et al.  SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[41]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .