论文信息 - Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion

Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion

In this work, we investigate the problem of Model-Agnostic Zero-Shot Classification (MA-ZSC), which refers to training non-specific classification architectures (downstream models) to classify real images without using any real images during training. Recent research has demonstrated that generating synthetic training images using diffusion models provides a potential solution to address MA-ZSC. However, the performance of this approach currently falls short of that achieved by large-scale vision-language models. One possible explanation is a potential significant domain gap between synthetic and real images. Our work offers a fresh perspective on the problem by providing initial insights that MA-ZSC performance can be improved by improving the diversity of images in the generated dataset. We propose a set of modifications to the text-to-image generation process using a pre-trained diffusion model to enhance diversity, which we refer to as our $\textbf{bag of tricks}$. Our approach shows notable improvements in various classification architectures, with results comparable to state-of-the-art models such as CLIP. To validate our approach, we conduct experiments on CIFAR10, CIFAR100, and EuroSAT, which is particularly difficult for zero-shot classification due to its satellite image domain. We evaluate our approach with five classification architectures, including ResNet and ViT. Our findings provide initial insights into the problem of MA-ZSC using diffusion models. All code will be available on GitHub.

Kien Nguyen Thanh | C. Fookes | A. Wiliem | Jordan Shipard | Wei Xiang

[1] Chongxuan Li,et al. Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels , 2023, ArXiv.

[2] R. Salakhutdinov,et al. Effective Data Augmentation With Diffusion Models , 2023, ArXiv.

[3] Zachary Chase Lipton,et al. CHiLS: Zero-Shot Image Classification with Hierarchical Label Sets , 2023, ArXiv.

[4] Aditya Grover,et al. Leaving Reality to Imagination: Robust Classification via Generated Datasets , 2023, ArXiv.

[5] Philip H. S. Torr,et al. Not Just Pretty Pictures: Text-to-Image Generators Enable Interpretable Interventions for Robust Representations , 2022, ArXiv.

[6] Ankush Gupta,et al. SuS-X: Training-Free Name-Only Transfer of Vision-Language Models , 2022, ArXiv.

[7] Ludwig Schmidt,et al. LAION-5B: An open large-scale dataset for training next generation image-text models , 2022, NeurIPS.

[8] Philip H. S. Torr,et al. Is synthetic data from generative models ready for image recognition? , 2022, ICLR.

[9] Carl Vondrick,et al. Visual Classification via Description from Large Language Models , 2022, ICLR.

[10] Ali Farhadi,et al. What does a platypus look like? Generating customized prompts for zero-shot image classification , 2022, ArXiv.

[11] Jonathan Ho. Classifier-Free Diffusion Guidance , 2022, ArXiv.

[12] David J. Fleet,et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding , 2022, NeurIPS.

[13] Prafulla Dhariwal,et al. Hierarchical Text-Conditional Image Generation with CLIP Latents , 2022, ArXiv.

[14] Trevor Darrell,et al. A ConvNet for the 2020s , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Prafulla Dhariwal,et al. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models , 2021, ICML.

[16] B. Ommer,et al. High-Resolution Image Synthesis with Latent Diffusion Models , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[18] Alec Radford,et al. Zero-Shot Text-to-Image Generation , 2021, ICML.

[19] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[20] Jiaming Song,et al. Denoising Diffusion Implicit Models , 2020, ICLR.

[21] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[22] Matthieu Cord,et al. This Dataset Does Not Exist: Training Models from Generated Images , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23] K. Keutzer,et al. Domain Randomization and Pyramid Consistency: Simulation-to-Real Generalization Without Accessing Target Domain Data , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24] Taghi M. Khoshgoftaar,et al. A survey on Image Data Augmentation for Deep Learning , 2019, Journal of Big Data.

[25] Quoc V. Le,et al. Searching for MobileNetV3 , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[26] Begüm Demir,et al. Bigearthnet: A Large-Scale Benchmark Archive for Remote Sensing Image Understanding , 2019, IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing Symposium.

[27] Chunyan Miao,et al. A Survey of Zero-Shot Learning , 2019, ACM Trans. Intell. Syst. Technol..

[28] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.

[29] Andreas Dengel,et al. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[30] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[31] Seetha Hari,et al. Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[32] Sepp Hochreiter,et al. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium , 2017, NIPS.

[33] Wojciech Zaremba,et al. Domain randomization for transferring deep neural networks from simulation to the real world , 2017, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[34] Bernt Schiele,et al. Zero-Shot Learning — The Good, the Bad and the Ugly , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.

[37] Christoph H. Lampert,et al. Attribute-Based Classification for Zero-Shot Visual Object Categorization , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38] James Hays,et al. SUN attribute database: Discovering, annotating, and recognizing scene attributes , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39] Fei-Fei Li,et al. ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[40] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[41] Geoffrey E. Hinton,et al. Visualizing Data using t-SNE , 2008 .