3D attention mechanism for fine-grained classification of table tennis strokes using a Twin Spatio-Temporal Convolutional Neural Networks

The paper addresses the problem of recognition of actions in video with low inter-class variability such as Table Tennis strokes. Two stream, "twin" convolutional neural networks are used with 3D convolutions both on RGB data and optical flow. Actions are recognized by classification of temporal windows. We introduce 3D attention modules and examine their impact on classification efficiency. In the context of the study of sportsmen performances, a corpus of the particular actions of table tennis strokes is considered. The use of attention blocks in the network speeds up the training step and improves the classification scores up to 5% with our twin model. We visualize the impact on the obtained features and notice correlation between attention and player movements and position. Score comparison of state-of-the-art action classification method and proposed approach with attentional blocks is performed on the corpus. Proposed model with attention blocks outperforms previous model without them and our baseline.

[1]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Chuang Gan,et al.  TSM: Temporal Shift Module for Efficient Video Understanding , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Jenny Benois-Pineau,et al.  Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis , 2018, 2018 International Conference on Content-Based Multimedia Indexing (CBMI).

[4]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Shuicheng Yan,et al.  A2-Nets: Double Attention Networks , 2018, NeurIPS.

[8]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ben Kenwright Free-Form Tetrahedron Deformation , 2015, ISVC.

[10]  Feiran Huang,et al.  Multimodal Network Embedding via Attention based Multi-view Variational Autoencoder , 2018, ICMR.

[11]  Yang Du,et al.  Interaction-Aware Spatio-Temporal Pyramid Attention Networks for Action Classification , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Qingming Huang,et al.  Channel-wise Temporal Attention Network for Video Action Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[13]  Alejandro Alvaro Ramírez-Acosta,et al.  Forward-backward visual saliency propagation in Deep NNs vs internal attentional mechanisms , 2019, 2019 Ninth International Conference on Image Processing Theory, Tools and Applications (IPTA).

[14]  Hui Zhang,et al.  Tac-Simur: Tactic-based Simulative Visual Analytics of Table Tennis , 2020, IEEE Transactions on Visualization and Computer Graphics.

[15]  Andreas Kunz,et al.  Res3ATN - Deep 3D Residual Attention Network for Hand Gesture Recognition in Videos , 2019, 2019 International Conference on 3D Vision (3DV).

[16]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[17]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18]  Xiao Liu,et al.  Attention Clusters: Purely Attention Based Local Feature Integration for Video Classification , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[21]  Jenny Benois-Pineau,et al.  Fast Action Localization in Large-Scale Video Archives , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[22]  Andrew Zisserman,et al.  Convolutional Two-Stream Network Fusion for Video Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[24]  Cordelia Schmid,et al.  AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[25]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[26]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Sung Wook Baik,et al.  Action Recognition in Video Sequences using Deep Bi-Directional LSTM With CNN Features , 2018, IEEE Access.

[28]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[30]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[31]  Boris Mansencal,et al.  Multiple Moving Object Detection for Fast Video Content Description in Compressed Domain , 2008, EURASIP J. Adv. Signal Process..

[32]  Roman Voeikov,et al.  TTNet: Real-time temporal and spatial video analysis of table tennis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[33]  Fabio Viola,et al.  The Kinetics Human Action Video Dataset , 2017, ArXiv.

[34]  Jianguo Hu,et al.  3D RANs: 3D Residual Attention Networks for action recognition , 2019, The Visual Computer.

[35]  Mubarak Shah,et al.  Learning a Deep Model for Human Action Recognition from Novel Viewpoints , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Cees Snoek,et al.  Dance With Flow: Two-In-One Stream Action Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Jitendra Malik,et al.  SlowFast Networks for Video Recognition , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[39]  Sanja Fidler,et al.  3D Object Proposals Using Stereo Imagery for Accurate Object Class Detection , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[41]  Cordelia Schmid,et al.  Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[42]  Hsien-I Lin,et al.  Ball Tracking and Trajectory Prediction for Table-Tennis Robots , 2020, Sensors.

[43]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[44]  Jenny Benois-Pineau,et al.  Optimal Choice of Motion Estimation Methods for Fine-Grained Action Classification with 3D Convolutional Networks , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[45]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.