Action recognition for sports video analysis using part-attention spatio-temporal graph convolutional network

Abstract. Action recognition makes significant contributions to sports video analysis, especially for athletes’ training evaluations. For sports video analysis, the action information is mainly conveyed by human body parts’ temporal movement, and each of the parts has a unique importance to the action representation. Aiming to involve this point in action recognition, we propose a part-attention spatio-temporal graph convolutional network (PSGCN) to exploit the dynamic spatio-temporal information in a sports video; it learns the importance of different parts to emphasize the contribution on the task of action recognition. Specifically, PSGCN first divides the human body into six parts and extracts their convolutional neural network (CNN) features, as well as concatenating the global feature of the whole frame; it then utilizes a cross-part and cross-frame graph building module to formulate the graph correlation of the parts from different frames. Inspired by the larger temporal variation of the same part containing more action information, we further propose a part-attention (PA) learning module to estimate the importance of each part, which can strengthen the graph correlation to support a PA graph. Finally, PSGCN conducts a graph convolutional network on the learned PA spatio-temporal graph with the learned part CNN features, which can obtain the action representation for the given sports video. In addition, the whole network is optimized by two losses of PA and action classification. To perform the superiority of PSGCN, we carry out extensive experiments of our model compared with several state-of-the-art methods on widely used action recognition datasets, especially for sports action. The results reflect the advantages of the proposed PSGCN on sports video analysis.

[1]  Sridha Sridharan,et al.  Predicting Ball Ownership in Basketball from a Monocular View Using Only Player Trajectories , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[2]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Victor Leboran,et al.  Dynamic Whitening Saliency , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Thomas B. Moeslund,et al.  Sports Type Classification Using Signature Heatmaps , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[5]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[6]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[7]  Juergen Gall,et al.  Open Set Domain Adaptation for Image and Action Recognition , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[9]  Hongbing Ji,et al.  Nonlinear gated channels networks for action recognition , 2020, Neurocomputing.

[10]  Nojun Kwak,et al.  Athlete Pose Estimation by a Global-Local Network , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[11]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[12]  Dacheng Tao,et al.  Slow Feature Analysis for Human Action Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jian Yang,et al.  Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition , 2018, AAAI.

[14]  Moritz Einfalt,et al.  Decoupling Video and Human Motion: Towards Practical Event Detection in Athlete Recordings , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[15]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[17]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Zhuowen Tu,et al.  Aggregated Residual Transformations for Deep Neural Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Rainer Herpers,et al.  Swim Stroke Analytic: Front Crawl Pulling Pose Classification , 2018, 2018 25th IEEE International Conference on Image Processing (ICIP).

[21]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[22]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[23]  Dan Zecha,et al.  Activity-Conditioned Continuous Human Pose Estimation for Performance Analysis of Athletes Using the Example of Swimming , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[24]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[25]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[26]  Xiaoyan Sun,et al.  Temporal–Spatial Mapping for Action Recognition , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[27]  Hanqiu Sun,et al.  Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks , 2020, IEEE Transactions on Image Processing.

[28]  Yunhong Wang,et al.  A Joint Framework for Athlete Tracking and Action Recognition in Sports Videos , 2020, IEEE Transactions on Circuits and Systems for Video Technology.

[29]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[31]  B. Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[32]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Xianglong Liu,et al.  Spatio-temporal deformable 3D ConvNets with attention for action recognition , 2020, Pattern Recognit..

[34]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[35]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[36]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Turgay Celik,et al.  Human Action Recognition using Local Two-Stream Convolution Neural Network Features and Support Vector Machines , 2020, ArXiv.

[38]  Adrian Hilton,et al.  Computer Vision in Sports , 2014, Advances in Computer Vision and Pattern Recognition.

[39]  Shiguang Shan,et al.  Occlusion Aware Facial Expression Recognition Using CNN With Attention Mechanism , 2019, IEEE Transactions on Image Processing.

[40]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Thomas B. Moeslund,et al.  Audio-Visual Classification of Sports Types , 2015, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW).

[43]  Junchi Yan,et al.  Learning Combinatorial Embedding Networks for Deep Graph Matching , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[44]  Jing-Hao Xue,et al.  Enhanced Grassmann discriminant analysis with randomized time warping for motion recognition , 2020, Pattern Recognit..

[45]  Snehasis Mukherjee,et al.  An information-rich sampling technique over spatio-temporal CNN for classification of human actions in videos , 2020, Multimedia Tools and Applications.

[46]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Qingshan Liu,et al.  Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network , 2020, Pattern Recognit..

[48]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[49]  Dinesh Kumar Vishwakarma,et al.  View-Invariant Deep Architecture for Human Action Recognition Using Two-Stream Motion and Shape Temporal Dynamics , 2020, IEEE Transactions on Image Processing.

[50]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Zhang Xiaolong Simulation analysis of athletes' motion recognition based on deep learning method and convolution algorithm , 2019, J. Intell. Fuzzy Syst..

[52]  Cuiping Zhang,et al.  Automatic detection technology of sports athletes based on image recognition technology , 2019, EURASIP Journal on Image and Video Processing.

[53]  Jianfei Yang,et al.  Region Attention Networks for Pose and Occlusion Robust Facial Expression Recognition , 2019, IEEE Transactions on Image Processing.

[54]  Naokazu Yokoya,et al.  Summarization of User-Generated Sports Video by Using Deep Action Recognition Features , 2017, IEEE Transactions on Multimedia.

[55]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.