Multi-Prompt with Depth Partitioned Cross-Modal Learning

In recent years, soft prompt learning methods have been proposed to fine-tune large-scale vision-language pre-trained models for various downstream tasks. These methods typically combine learnable textual tokens with class tokens as input for models with frozen parameters. However, they often employ a single prompt to describe class contexts, failing to capture categories' diverse attributes adequately. This study introduces the Partitioned Multi-modal Prompt (PMPO), a multi-modal prompting technique that extends the soft prompt from a single learnable prompt to multiple prompts. Our method divides the visual encoder depths and connects learnable prompts to the separated visual depths, enabling different prompts to capture the hierarchical contextual depths of visual representations. Furthermore, to maximize the advantages of multi-prompt learning, we incorporate prior information from manually designed templates and learnable multi-prompts, thus improving the generalization capabilities of our approach. We evaluate the effectiveness of our approach on three challenging tasks: new class generalization, cross-dataset evaluation, and domain generalization. For instance, our method achieves a $79.28$ harmonic mean, averaged over 11 diverse image recognition datasets ($+7.62$ compared to CoOp), demonstrating significant competitiveness compared to state-of-the-art prompting methods.

[1]  Rui Zhang,et al.  Visual-Language Prompt Tuning with Knowledge-Guided Context Optimization , 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  S. Savarese,et al.  BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models , 2023, ArXiv.

[3]  Zhirui Zhang,et al.  Improving Few-Shot Performance of Language Models via Nearest Neighbor Calibration , 2022, ArXiv.

[4]  F. Khan,et al.  MaPLe: Multi-modal Prompt Learning , 2022, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xiangchen Song,et al.  Prompt Learning with Optimal Transport for Vision-Language Models , 2022, ICLR.

[6]  Ashish V. Thapliyal,et al.  PaLI: A Jointly-Scaled Multilingual Language-Image Model , 2022, ICLR.

[7]  Li Dong,et al.  Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks , 2022, ArXiv.

[8]  Serge J. Belongie,et al.  Visual Prompt Tuning , 2022, ECCV.

[9]  Chen Change Loy,et al.  Conditional Prompt Learning for Vision-Language Models , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[11]  Jonathan Berant,et al.  Learning To Retrieve Prompts for In-Context Learning , 2021, NAACL.

[12]  Peng Gao,et al.  PointCLIP: Point Cloud Understanding by CLIP , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Li Dong,et al.  VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.

[14]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[15]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[16]  Peng Gao,et al.  CLIP-Adapter: Better Vision-Language Models with Feature Adapters , 2021, Int. J. Comput. Vis..

[17]  Jason Baldridge,et al.  MURAL: Multimodal, Multitask Retrieval Across Languages , 2021, ArXiv.

[18]  Junnan Li,et al.  Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Quoc V. Le,et al.  Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.

[21]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[22]  D. Song,et al.  The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Frank F. Xu,et al.  How Can We Know What Language Models Know? , 2019, Transactions of the Association for Computational Linguistics.

[25]  Fan Yang,et al.  Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[26]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[27]  Eric P. Xing,et al.  Learning Robust Global Representations by Penalizing Local Predictive Power , 2019, NeurIPS.

[28]  Benjamin Recht,et al.  Do ImageNet Classifiers Generalize to ImageNet? , 2019, ICML.

[29]  Andreas Dengel,et al.  EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification , 2017, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing.

[30]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Matthieu Guillaumin,et al.  Food-101 - Mining Discriminative Components with Random Forests , 2014, ECCV.

[32]  Iasonas Kokkinos,et al.  Describing Textures in the Wild , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Jonathan Krause,et al.  3D Object Representations for Fine-Grained Categorization , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[34]  Subhransu Maji,et al.  Fine-Grained Visual Classification of Aircraft , 2013, ArXiv.

[35]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[36]  C. V. Jawahar,et al.  Cats and dogs , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[37]  Krista A. Ehinger,et al.  SUN database: Large-scale scene recognition from abbey to zoo , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[38]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[40]  Andrew Zisserman,et al.  Automated Flower Classification over a Large Number of Classes , 2008, 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.

[41]  Pietro Perona,et al.  Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.