Learning to Compose Diversified Prompts for Image Emotion Classification

Contrastive Language-Image Pre-training (CLIP) represents the latest incarnation of pre-trained vision-language models. Although CLIP has recently shown its superior power on a wide range of downstream vision-language tasks like Visual Question Answering, it is still underexplored for Image Emotion Classification (IEC). Adapting CLIP to the IEC task has three significant challenges, tremendous training objective gap between pretraining and IEC, shared suboptimal and invariant prompts for all instances. In this paper, we propose a general framework that shows how CLIP can be effectively applied to IEC. We first introduce a prompt tuning method that mimics the pretraining objective of CLIP and thus can leverage the rich image and text semantics entailed in CLIP. Then we automatically compose instance-specific prompts by conditioning them on the categories and image contents of instances, diversifying prompts and avoiding suboptimal problems. Evaluations on six widely-used affective datasets demonstrate that our proposed method outperforms the state-of-theart methods to a large margin (i.e., up to 9.29% accuracy gain on EmotionROI dataset) on IEC tasks, with only a few parameters trained. Our codes will be publicly available for research purposes.

[1]  Ariel Zamparini,et al.  Applied Sciences - About , 2014 .

[2]  R. Hecht-Nielsen Neurocomputing , 2020, Issue 4.

[3]  Cshell Xui Prompt , 2019, Encyclopedia of Personality and Individual Differences.

[4]  Kurt Keutzer,et al.  Emotion Recognition From Multiple Modalities: Fundamentals and methodologies , 2021, IEEE Signal Processing Magazine.

[5]  Brian Lester,et al.  The Power of Scale for Parameter-Efficient Prompt Tuning , 2021, EMNLP.

[6]  Rongrong Ji,et al.  Large-scale visual sentiment ontology and detectors using adjective noun pairs , 2013, ACM Multimedia.

[7]  IEEE Transactions on Multimedia , 2019, IEEE Transactions on Multimedia.

[8]  Tao Chen,et al.  DeepSentiBank: Visual Sentiment Concept Classification with Deep Convolutional Neural Networks , 2014, ArXiv.

[9]  Hiroaki Hayashi,et al.  Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , 2021, ACM Comput. Surv..

[10]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[11]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, ArXiv.

[12]  Jiebo Luo,et al.  Robust Image Sentiment Analysis Using Progressively Trained and Domain Transferred Deep Networks , 2015, AAAI.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Tsuhan Chen,et al.  Where do emotions come from? Predicting the Emotion Stimuli Map , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[15]  Nicu Sebe,et al.  Proceedings of the 21st ACM international conference on Multimedia , 2013, MM 2013.

[16]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[17]  Jiebo Luo,et al.  Building a Large Scale Dataset for Image Emotion Recognition: The Fine Print and The Benchmark , 2016, AAAI.

[18]  Lifang Wu,et al.  Discovering Sentimental Interaction via Graph Convolutional Network for Visual Sentiment Prediction , 2021 .

[19]  Haimin Zhang,et al.  Weakly Supervised Emotion Intensity Prediction for Recognition of Emotions in Images , 2021, IEEE Transactions on Multimedia.

[20]  Qingming Huang,et al.  Dependency Exploitation: A Unified CNN-RNN Approach for Visual Emotion Recognition , 2017, IJCAI.

[21]  Kurt Keutzer,et al.  Affective Image Content Analysis: Two Decades Review and New Perspectives , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Proceedings of the 22nd ACM international conference on Multimedia , 2014 .

[23]  M. Hassoun,et al.  Neural processing letters , 2000 .

[24]  Min Xu,et al.  Multi-level region-based Convolutional Neural Network for image emotion classification , 2019, Neurocomputing.

[25]  Min Xu,et al.  Learning Multi-level Deep Representations for Image Emotion Classification , 2016, Neural Processing Letters.

[26]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[27]  Ruihai Dong,et al.  Emotion Class-Wise Aware Loss for Image Emotion Classification , 2021, CICAI.

[28]  Constance de Koning,et al.  Editors , 2003, Annals of Emergency Medicine.

[29]  Paul L. Rosin,et al.  Visual Sentiment Prediction Based on Automatic Discovery of Affective Regions , 2018, IEEE Transactions on Multimedia.

[30]  Percy Liang,et al.  Prefix-Tuning: Optimizing Continuous Prompts for Generation , 2021, ACL.

[31]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[32]  Ming-Hsuan Yang,et al.  Weakly Supervised Coupled Networks for Visual Sentiment Analysis , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.