Ada-DQA: Adaptive Diverse Quality-aware Feature Acquisition for Video Quality Assessment

Video quality assessment (VQA) has attracted growing attention in recent years. While the great expense of annotating large-scale VQA datasets has become the main obstacle for current deep-learning methods. To surmount the constraint of insufficient training data, in this paper, we first consider the complete range of video distribution diversity (\ie content, distortion, motion) and employ diverse pretrained models (\eg architecture, pretext task, pre-training dataset) to benefit quality representation. An Adaptive Diverse Quality-aware feature Acquisition (Ada-DQA) framework is proposed to capture desired quality-related features generated by these frozen pretrained models. By leveraging the Quality-aware Acquisition Module (QAM), the framework is able to extract more essential and relevant features to represent quality. Finally, the learned quality representation is utilized as supplementary supervisory information, along with the supervision of the labeled quality score, to guide the training of a relatively lightweight VQA model in a knowledge distillation manner, which largely reduces the computational cost during inference. Experimental results on three mainstream no-reference VQA benchmarks clearly show the superior performance of Ada-DQA in comparison with current state-of-the-art approaches without using extra training data of VQA.

[1]  Qiong Yan,et al.  FAST-VQA: Efficient End-to-end Video Quality Assessment with Fragment Sampling , 2022, ECCV.

[2]  Junyong You,et al.  Long Short-term Convolutional Transformer for No-Reference Video Quality Assessment , 2021, ACM Multimedia.

[3]  Zhibo Chen,et al.  Perceptual Quality Assessment of Internet Videos , 2021, ACM Multimedia.

[4]  Yuan-Gen Wang,et al.  Starvqa: Space-Time Attention for Video Quality Assessment , 2021, 2022 IEEE International Conference on Image Processing (ICIP).

[5]  Guangtao Zhai,et al.  Blindly Assess Quality of In-the-Wild Videos via Quality-Aware Pre-Training and Motion Perception , 2021, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Stephen Lin,et al.  Video Swin Transformer , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Joong Gon Yim,et al.  Rich features for perceptual quality assessment of UGC videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Y. Andreopoulos,et al.  Deep Perceptual Preprocessing for Video Coding , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Cordelia Schmid,et al.  ViViT: A Video Vision Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[10]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[11]  Heng Wang,et al.  Is Space-Time Attention All You Need for Video Understanding? , 2021, ICML.

[12]  Alan C. Bovik,et al.  RAPIQUE: Rapid and Accurate Video Quality Prediction of User Generated Content , 2021, IEEE Open Journal of Signal Processing.

[13]  Alan Bovik University of Texas at Austin,et al.  Patch-VQ: ‘Patching Up’ the Video Quality Problem , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Tingting Jiang,et al.  Unified Quality Assessment of in-the-Wild Videos with Mixed Datasets Training , 2020, Int. J. Comput. Vis..

[15]  Pengfei Chen,et al.  RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment , 2020, ACM Multimedia.

[16]  Junyong You,et al.  Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial Features , 2020, ACM Multimedia.

[17]  Alan C. Bovik,et al.  UGC-VQA: Benchmarking Blind Video Quality Assessment for User Generated Content , 2020, IEEE Transactions on Image Processing.

[18]  Huan Liu,et al.  No-reference video quality evaluation by a deep transfer CNN architecture , 2020, Signal Process. Image Commun..

[19]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[20]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Junyong You,et al.  Deep Neural Networks for No-Reference Video Quality Assessment , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[22]  Ming Jiang,et al.  Quality Assessment of In-the-Wild Videos , 2019, ACM Multimedia.

[23]  Gaofeng Meng,et al.  No-Reference Image Quality Assessment with Reinforcement Recursive List-Wise Ranking , 2019, AAAI.

[24]  Jie Gu,et al.  Blind image quality assessment via learnable attention-based pooling , 2019, Pattern Recognit..

[25]  Jari Korhonen,et al.  Two-Level Approach for No-Reference Consumer Video Quality Assessment , 2019, IEEE Transactions on Image Processing.

[26]  Domonkos Varga,et al.  No-reference video quality assessment via pretrained CNN and LSTM networks , 2019, Signal Image Video Process..

[27]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[28]  Balu Adsumilli,et al.  YouTube UGC Dataset for Video Compression Research , 2019, 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP).

[29]  Heng Wang,et al.  Video Classification With Channel-Separated Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[31]  Andrew Zisserman,et al.  A Short Note about Kinetics-600 , 2018, ArXiv.

[32]  Dietmar Saupe,et al.  The Konstanz natural video database (KoNViD-1k) , 2017, 2017 Ninth International Conference on Quality of Multimedia Experience (QoMEX).

[33]  Weixia Zhang,et al.  Blind Image Quality Assessment Based on Natural Redundancy Statistics , 2016, ACCV.

[34]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[35]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[36]  Yanjiao Chen,et al.  From QoS to QoE: A Tutorial on Video Quality Assessment , 2015, IEEE Communications Surveys & Tutorials.

[37]  Christophe Charrier,et al.  Blind Prediction of Natural Video Quality , 2014, IEEE Transactions on Image Processing.

[38]  Phuoc Tran-Gia,et al.  Best Practices for QoE Crowdtesting: QoE Assessment With Crowdsourcing , 2014, IEEE Transactions on Multimedia.

[39]  Alan C. Bovik,et al.  No-Reference Image Quality Assessment in the Spatial Domain , 2012, IEEE Transactions on Image Processing.

[40]  Christophe Charrier,et al.  Blind Image Quality Assessment: A Natural Scene Statistics Approach in the DCT Domain , 2012, IEEE Transactions on Image Processing.

[41]  Martin Reisslein,et al.  Objective Video Quality Assessment Methods: A Classification, Review, and Performance Comparison , 2011, IEEE Transactions on Broadcasting.

[42]  Chin-Laung Lei,et al.  Quadrant of euphoria: a crowdsourcing platform for QoE assessment , 2010, IEEE Network.

[43]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Flemming Topsøe,et al.  Jensen-Shannon divergence and Hilbert space embedding , 2004, International Symposium onInformation Theory, 2004. ISIT 2004. Proceedings..

[45]  Dietmar Saupe,et al.  KonVid-150k: A Dataset for No-Reference Video Quality Assessment of Videos in-the-Wild , 2021, IEEE Access.

[46]  Stephen Lin,et al.  Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[47]  Alan C. Bovik,et al.  A Completely Blind Video Integrity Oracle , 2016, IEEE Transactions on Image Processing.