论文信息 - Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Sport Action Recognition with Siamese Spatio-Temporal CNNs: Application to Table Tennis

Human action recognition in video is one of the key problems in visual data interpretation. Despite intensive research, the recognition of actions with low inter-class variability remains a challenge. This paper presents a new Siamese Spatio-Temporal Convolutional neural network (SSTC) for this purpose. When applied to table tennis, it is possible to detect and recognize 20 table tennis strokes. The model has been trained on a specific dataset, TTStroke-21, recorded in natural condition (markerless) at the Faculty of Sports of the University of Bordeaux. Our model takes as inputs a RGB image sequence and its computed Optical Flow. After 3 spatio-temporal convolutions, data are fused in a fully connected layer of a proposed siamese network architecture. Our method reaches an accuracy of 91.4% against 43.1% for our baseline.

[1] Jenny Benois-Pineau,et al. Fast Action Localization in Large-Scale Video Archives , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[2] Cordelia Schmid,et al. AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[3] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[4] Cordelia Schmid,et al. Long-Term Temporal Convolutions for Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Hakan Bilen,et al. Explorer Action Recognition with Dynamic Image Networks , 2017 .

[6] Ce Liu,et al. Exploring new representations and applications for motion analysis , 2009 .

[7] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8] Ferdinand van der Heijden,et al. Efficient adaptive density estimation per image pixel for the task of background subtraction , 2006, Pattern Recognit. Lett..

[9] Nannan Li,et al. Tube ConvNets: Better exploiting motion for action recognition , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[10] Ling Shao,et al. Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Mubarak Shah,et al. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[13] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Dima Damen,et al. Scaling Egocentric Vision: The EPIC-KITCHENS Dataset , 2018, ArXiv.

[15] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[16] Noel E. O'Connor,et al. Towards Automatic Activity Classification and Movement Assessment during a Sports Training Session , 2022 .

[17] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[18] Sergio Escalera,et al. ChaLearn Looking at People Challenge 2014: Dataset and Results , 2014, ECCV Workshops.

[19] Andrew Zisserman,et al. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Fabio Viola,et al. The Kinetics Human Action Video Dataset , 2017, ArXiv.

[21] Rui Hou,et al. Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).