Group-aware Contrastive Regression for Action Quality Assessment

Assessing action quality is challenging due to the subtle differences between videos and large variations in scores. Most existing approaches tackle this problem by regressing a quality score from a single video, suffering a lot from the large inter-video score variations. In this paper, we show that the relations among videos can provide important clues for more accurate action quality assessment during both training and inference. Specifically, we reformulate the problem of action quality assessment as regressing the relative scores with reference to another video that has shared attributes (e.g., category and difficulty), instead of learning unreferenced scores. Following this formulation, we propose a new Contrastive Regression (CoRe) framework to learn the relative scores by pair-wise comparison, which highlights the differences between videos and guides the models to learn the key hints for assessment. In order to further exploit the relative information between two videos, we devise a group-aware regression tree to convert the conventional score regression into two easier sub-problems: coarse-to-fine classification and regression in small intervals. To demonstrate the effectiveness of CoRe, we conduct extensive experiments on three mainstream AQA datasets including AQA-7, MTL-AQA and JIGSAWS. Our approach outperforms previous methods by a large margin and establishes new state-of-the-art on all three benchmarks.

[1]  Xiangyang Xue,et al.  Learning to Score Figure Skating Sport Videos , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Yansong Tang,et al.  COIN: A Large-Scale Dataset for Comprehensive Instructional Video Analysis , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Dima Damen,et al.  The Pros and Cons: Rank-Aware Temporal Attention for Skill Determination in Long Videos , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Shilei Wen,et al.  BMN: Boundary-Matching Network for Temporal Action Proposal Generation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[5]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[6]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Andrew S. Gordon,et al.  Automated Video Assessment of Human Performance , 1997 .

[9]  Brendan Tran Morris,et al.  Action Quality Assessment Across Multiple Actions , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  Baoxin Li,et al.  Relative Hidden Markov Models for Video-Based Evaluation of Motion Skills in Surgical Training , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Irfan Essa,et al.  Video Based Assessment of OSATS Using Sequential Motion Textures , 2014 .

[14]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[15]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[16]  Baoxin Li,et al.  Video-based motion expertise analysis in simulation-based surgical training using hierarchical dirichlet process hidden markov model , 2011, MMAR '11.

[17]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Antonio Torralba,et al.  Assessing the Quality of Actions , 2014, ECCV.

[20]  Amaia Salvador,et al.  Temporal Activity Detection in Untrimmed Videos with Recurrent Neural Networks , 2016, NIPS 2016.

[21]  Li Fei-Fei,et al.  End-to-End Learning of Action Detection from Frame Glimpses in Videos , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Brendan Tran Morris,et al.  Learning to Score Olympic Events , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[23]  Brendan Tran Morris,et al.  What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ying Wu,et al.  Uncertainty-Aware Score Distribution Learning for Action Quality Assessment , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Huafeng Chen,et al.  Action Recognition Using Visual Attention with Reinforcement Learning , 2018, MMM.

[26]  Pavan K. Turaga,et al.  Dynamical Regularity for Action Analysis , 2015, BMVC.

[27]  Dima Damen,et al.  Who's Better, Who's Best: Skill Determination in Video using Deep Ranking , 2017, ArXiv.

[28]  Gregory D. Hager,et al.  Pairwise Comparison-Based Objective Score for Automated Skill Assessment of Segments in a Surgical Task , 2014, IPCAI.

[29]  Stanislav Kovacic,et al.  Trajectory Based Assessment of Coordinated Human Activity , 2003, ICVS.

[30]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[31]  Henry C. Lin,et al.  JHU-ISI Gesture and Skill Assessment Working Set ( JIGSAWS ) : A Surgical Activity Dataset for Human Motion Modeling , 2014 .

[32]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[33]  Matej Kristan,et al.  Automatic Evaluation of Organized Basketball Activity using Bayesian Networks , 2007 .

[34]  Wei-Shi Zheng,et al.  Action Assessment by Joint Relation Graphs , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[35]  Kaiming He,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Irfan A. Essa,et al.  Video and accelerometer-based motion analysis for automated surgical skills assessment , 2017, International Journal of Computer Assisted Radiology and Surgery.

[37]  Jianbo Shi,et al.  Am I a Baller? Basketball Performance Assessment from First-Person Videos , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[38]  D. Basak,et al.  Support Vector Regression , 2008 .

[39]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Irfan A. Essa,et al.  Automated Assessment of Surgical Skills Using Frequency Analysis , 2015, MICCAI.