Exploring CLIP for Assessing the Look and Feel of Images

. Measuring the perception of visual content is a long-standing problem in computer vision. Many mathematical models have been devel-oped to evaluate the look or quality of an image. Despite the effectiveness of such tools in quantifying degradations such as noise and blurriness lev-els, such quantification is loosely coupled with human language. When it comes to more abstract perception about the feel of visual content, existing methods can only rely on supervised models that are explicitly trained with labeled data collected via laborious user study. In this paper, we go beyond the conventional paradigms by exploring the rich visual language prior encapsulated in Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception ( look ) and abstract perception ( feel ) of images in a zero-shot manner. In particular, we discuss effective prompt designs and show an effective prompt pairing strategy to harness the prior. We also provide extensive experiments on controlled datasets and Image Quality Assessment (IQA) benchmarks. Our results show that CLIP captures meaningful priors that general-ize well to different perceptual assessments. Code will be avaliable at https://github.com/IceClear/CLIP-IQA . Abstract perception. Abstract perception is a subjective feeling and is not quantifiable. Existing methods for abstract perception assessment mainly focus on aesthetic image assessment [18,26,35,48,50,63] and image emotion analysis [1,5,13,25,37,61,64]. The majority of these methods require human annotations for training, which are laborious to obtain. Therefore, in this work, we explore the possibility of exploiting vision-language priors captured in CLIP to bypass the quality labeling process. Our results show that our CLIP-IQA is able to perceive abstract aspects of an image, even without training with annotations. approaches. Benefiting from the large-scale visual-language shown impressive capability and generalizability on a wide range of tasks, such as image manipulation image cap-tioning view synthesis 49, and semantic These applications mainly focus on building the semantic relationship between and and hence they suffer less linguistic ambiguity. from focus on the effectiveness understand both quality and abstract perceptions of an image. Specifically, our CLIP-IQA with paired prompts and the removal of positional embedding achieves comparable performance to learning-based methods, which require careful designs in network architecture and extensive task-specific training. Our findings provide a solid ground for CLIP-based quality assessment, and more generally perception assessment. We believe our studies and discussion could motivate future development in various domains, such as sophisticated prompts, better generalizability, and effective adoption of CLIP prior.

[1]  S. Kwong,et al.  VCRNet: Visual Compensation Restoration Network for No-Reference Image Quality Assessment , 2022, IEEE Transactions on Image Processing.

[2]  Munawar Hayat,et al.  ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via Exploiting CLIP Cues , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Xin Jin,et al.  Pseudo-labelling and Meta Reweighting Learning for Image Aesthetic Quality Assessment , 2022, ArXiv.

[4]  Lu Yuan,et al.  RegionCLIP: Region-based Language-Image Pretraining , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Jiwen Lu,et al.  DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  L. Gool,et al.  Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Chen Change Loy,et al.  Learning to Prompt for Vision-Language Models , 2021, International Journal of Computer Vision.

[8]  Yin Cui,et al.  Open-vocabulary Object Detection via Vision and Language Knowledge Distillation , 2021, ICLR.

[9]  Sukhdev Singh,et al.  Natural language processing: state of the art, current trends and challenges , 2017, Multimedia Tools and Applications.

[10]  Ron Mokady,et al.  ClipCap: CLIP Prefix for Image Captioning , 2021, ArXiv.

[11]  Peng Gao,et al.  Tip-Adapter: Training-free CLIP-Adapter for Better Vision-Language Modeling , 2021, ArXiv.

[12]  Peyman Milanfar,et al.  MUSIQ: Multi-scale Image Quality Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[13]  Yedid Hoshen,et al.  An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild , 2021, NeurIPS.

[14]  Xinbo Gao,et al.  Learning the Non-differentiable Optimization for Blind Super-Resolution , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Manri Cheon,et al.  Perceptual Image Quality Assessment with Transformers , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[16]  Ronan Le Bras,et al.  CLIPScore: A Reference-free Evaluation Metric for Image Captioning , 2021, EMNLP.

[17]  Pieter Abbeel,et al.  Putting NeRF on a Diet: Semantically Consistent Few-Shot View Synthesis , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  Daniel Cohen-Or,et al.  StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[19]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[20]  Shiqi Wang,et al.  Comparison of Full-Reference Image Quality Models for Optimization of Image Processing Systems , 2021, Int. J. Comput. Vis..

[21]  Maks Ovsjanikov,et al.  ArtEmis: Affective Language for Visual Art , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Rui Xu,et al.  Positional Encoding as Spatial Inductive Bias in GANs , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Yu-Kun Lai,et al.  APSE: Attention-Aware Polarity-Sensitive Embedding for Emotion-Based Image Retrieval , 2020, IEEE Transactions on Multimedia.

[24]  Bo Dai,et al.  DenseCLIP: Extract Free Dense Labels from CLIP , 2021, ArXiv.

[25]  Lingqiao Liu,et al.  Semi-supervised Adversarial Learning for Attribute-Aware Photo Aesthetic Assessment , 2021, IEEE Transactions on Multimedia.

[26]  Jong Chul Ye,et al.  DiffusionCLIP: Text-guided Image Manipulation Using Diffusion Models , 2021, ArXiv.

[27]  Sunghyun Cho,et al.  Real-World Blur Dataset for Learning and Benchmarking Deblurring Algorithms , 2020, ECCV.

[28]  Haoyu Chen,et al.  PIPAL: a Large-Scale Image Quality Assessment Dataset for Perceptual Image Restoration , 2020, ECCV.

[29]  Yu Zhu,et al.  Blindly Assess Image Quality in the Wild Guided by a Self-Adaptive Hyper Network , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Kede Ma,et al.  Perceptual Quality Assessment of Smartphone Photography , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Seon Joo Kim,et al.  Investigating Loss Functions for Extreme Super-Resolution , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[32]  Guangming Shi,et al.  MetaIQA: Deep Meta-Learning for No-Reference Image Quality Assessment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Dietmar Saupe,et al.  KonIQ-10k: An Ecologically Valid Database for Deep Learning of Blind Image Quality Assessment , 2019, IEEE Transactions on Image Processing.

[34]  Zhou Wang,et al.  Blind Image Quality Assessment Using a Deep Bilinear Convolutional Neural Network , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[35]  Lei Zhang,et al.  A Unified Probabilistic Formulation of Image Aesthetic Assessment , 2020, IEEE Transactions on Image Processing.

[36]  Yu Qiao,et al.  RankSRGAN: Generative Adversarial Networks With Ranker for Image Super-Resolution , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37]  Feiyue Huang,et al.  Attention-based Multi-Patch Aggregation for Image Aesthetic Assessment , 2018, ACM Multimedia.

[38]  Chen Wei,et al.  Deep Retinex Decomposition for Low-Light Enhancement , 2018, BMVC.

[39]  Amit K. Roy-Chowdhury,et al.  Contemplating Visual Emotions: Understanding and Overcoming Dataset Bias , 2018, ECCV.

[40]  Hong Cai,et al.  PieAPP: Perceptual Image-Error Assessment Through Pairwise Preference , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[41]  David Zhang,et al.  Real-world Noisy Image Denoising: A New Benchmark , 2018, ArXiv.

[42]  Zhengfang Duanmu,et al.  End-to-End Blind Image Quality Assessment Using Deep Neural Networks , 2018, IEEE Transactions on Image Processing.

[43]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[44]  Yochai Blau,et al.  The Perception-Distortion Tradeoff , 2017, CVPR.

[45]  In-Kwon Lee,et al.  Building Emotional Machines: Recognizing Image Emotions Through Deep Neural Networks , 2017, IEEE Transactions on Multimedia.

[46]  Sebastian Bosse,et al.  Deep Neural Networks for No-Reference and Full-Reference Image Quality Assessment , 2016, IEEE Transactions on Image Processing.

[47]  Chih-Yuan Yang,et al.  Learning a No-Reference Quality Metric for Single-Image Super-Resolution , 2016, Comput. Vis. Image Underst..

[48]  Radomír Mech,et al.  Photo Aesthetics Ranking Network with Attributes and Content Adaptation , 2016, ECCV.

[49]  Alan C. Bovik,et al.  Massive Online Crowdsourced Study of Subjective and Objective Picture Quality , 2015, IEEE Transactions on Image Processing.

[50]  Lei Zhang,et al.  A Feature-Enriched Completely Blind Image Quality Evaluator , 2015, IEEE Transactions on Image Processing.

[51]  H. Qi,et al.  Image color transfer to evoke different emotions based on color combinations , 2013, Signal Image Video Process..

[52]  Nikolay N. Ponomarenko,et al.  Image database TID2013: Peculiarities, results and perspectives , 2015, Signal Process. Image Commun..

[53]  Hongyu Li,et al.  VSI: A Visual Saliency-Induced Index for Perceptual Image Quality Assessment , 2014, IEEE Transactions on Image Processing.

[54]  Yi Li,et al.  Convolutional Neural Networks for No-Reference Image Quality Assessment , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Lei Zhang,et al.  Gradient Magnitude Similarity Deviation: A Highly Efficient Perceptual Image Quality Index , 2013, IEEE Transactions on Image Processing.

[56]  Erkki Oja,et al.  Affective Abstract Image Classification and Retrieval Using Multiple Kernel Learning , 2013, ICONIP.

[57]  Alan C. Bovik,et al.  Making a “Completely Blind” Image Quality Analyzer , 2013, IEEE Signal Processing Letters.

[58]  Alan C. Bovik,et al.  No-Reference Image Quality Assessment in the Spatial Domain , 2012, IEEE Transactions on Image Processing.

[59]  Christophe Charrier,et al.  Blind Image Quality Assessment: A Natural Scene Statistics Approach in the DCT Domain , 2012, IEEE Transactions on Image Processing.

[60]  David S. Doermann,et al.  Unsupervised feature learning framework for no-reference image quality assessment , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[61]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Zhou Wang,et al.  Applications of Objective Image Quality Assessment Methods [Applications Corner] , 2011, IEEE Signal Processing Magazine.

[63]  David Zhang,et al.  FSIM: A Feature Similarity Index for Image Quality Assessment , 2011, IEEE Transactions on Image Processing.

[64]  Sylvain Paris,et al.  Learning photographic global tonal adjustment with a database of input/output image pairs , 2011, CVPR 2011.

[65]  Zhou Wang,et al.  Applications of Objective Image Quality Assessment Methods , 2011 .

[66]  Alan C. Bovik,et al.  A Two-Step Framework for Constructing Blind Image Quality Indices , 2010, IEEE Signal Processing Letters.

[67]  Eric C. Larson,et al.  Most apparent distortion: full-reference image quality assessment and the role of strategy , 2010, J. Electronic Imaging.

[68]  Alan C. Bovik,et al.  A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms , 2006, IEEE Transactions on Image Processing.

[69]  Tamás Szirányi,et al.  Artifact reduction with diffusion preprocessing for image compression , 2005 .

[70]  Peter J. Lang,et al.  Gaze Patterns When Looking at Emotional Pictures: Motivationally Biased Attention , 2004 .

[71]  Alan C. Bovik,et al.  Image information and visual quality , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[72]  Eero P. Simoncelli,et al.  Image quality assessment: from error visibility to structural similarity , 2004, IEEE Transactions on Image Processing.

[73]  Zhou Wang,et al.  Multiscale structural similarity for image quality assessment , 2003, The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003.

[74]  D. Ruderman The statistics of natural images , 1994 .