Class-Aware Visual Prompt Tuning for Vision-Language Pre-Trained Model
暂无分享,去创建一个
With the emergence of large pre-trained vison-language model like CLIP, transferrable representations can be adapted to a wide range of downstream tasks via prompt tuning. Prompt tuning tries to probe the beneficial information for downstream tasks from the general knowledge stored in both the image and text encoders of the pre-trained vision-language model. A recently proposed method named Con- text Optimization (CoOp) introduces a set of learnable vectors as text prompt from the language side, while tuning the text prompt alone can not affect the computed visual features of the image encoder, thus leading to sub-optimal. In this paper, we propose a dual modality prompt tuning paradigm through learning text prompts and visual prompts for both the text and image encoder simultaneously. In addi-tion, to make the visual prompt concentrate more on the tar- get visual concept, we propose Class-Aware Visual Prompt Tuning (CAVPT), which is generated dynamically by per- forming the cross attention between language descriptions of template prompts and visual class token embeddings. Our method provides a new paradigm for tuning the large pretrained vision-language model and extensive experimental re- sults on 8 datasets demonstrate the effectiveness of the proposed method. Our code is available in the supplementary materials.