Three-Stream 3D/1D CNN for Fine-Grained Action Classification and Segmentation in Table Tennis

This paper proposes a fusion method of modalities extracted from video through a three-stream network with spatio-temporal and temporal convolutions for fine-grained action classification in sport. It is applied to TTStroke-21 dataset which consists of untrimmed videos of table tennis games. The goal is to detect and classify table tennis strokes in the videos, the first step of a bigger scheme aiming at giving feedback to the players for improving their performance. The three modalities are raw RGB data, the computed optical flow and the estimated pose of the player. The network consists of three branches with attention blocks. Features are fused at the latest stage of the network using bilinear layers. Compared to previous approaches, the use of three modalities allows faster convergence and better performances on both tasks: classification of strokes with known temporal boundaries and joint segmentation and classification. The pose is also further investigated in order to offer richer feedback to the athletes.

[1]  Yi Li,et al.  RESOUND: Towards Action Recognition Without Representation Bias , 2018, ECCV.

[2]  Mubarak Shah,et al.  UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild , 2012, ArXiv.

[3]  David Picard,et al.  2D/3D Pose Estimation and Action Recognition Using Multitask Deep Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Shiqiang Wang,et al.  Detection of Tennis Events from Acoustic Data , 2019, MMSports '19.

[5]  Chonho Lee,et al.  Prediction of Future Shot Direction using Pose and Position of Tennis Player , 2019, MMSports '19.

[6]  Jenny Benois-Pineau,et al.  Optimal Choice of Motion Estimation Methods for Fine-Grained Action Classification with 3D Convolutional Networks , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[7]  Hideki Koike,et al.  FuturePong: Real-time Table Tennis Trajectory Forecasting using Pose Prediction Network , 2020, CHI Extended Abstracts.

[8]  Cordelia Schmid,et al.  LCR-Net: Localization-Classification-Regression for Human Pose , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Ming Shao,et al.  A Multi-stream Bi-directional Recurrent Neural Network for Fine-Grained Action Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Rainhard Dieter Findling,et al.  Tennis Stroke Classification: Comparing Wrist and Racket as IMU Sensor Position , 2019, MoMM.

[11]  Andrew Zisserman,et al.  The AVA-Kinetics Localized Human Actions Video Dataset , 2020, ArXiv.

[12]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Zheng Li,et al.  Racquet Sports Recognition Using a Hybrid Clustering Model Learned from Integrated Wearable Sensor , 2020, Sensors.

[14]  Yue Zhao,et al.  FineGym: A Hierarchical Video Dataset for Fine-Grained Action Understanding , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Alejandro Cartas,et al.  Activities of Daily Living Monitoring via a Wearable Camera: Toward Real-World Applications , 2020, IEEE Access.

[16]  Cordelia Schmid,et al.  LCR-Net++: Multi-Person 2D and 3D Pose Detection in Natural Images , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[18]  Sheng Liu,et al.  Towards Understanding the Adversarial Vulnerability of Skeleton-based Action Recognition , 2020, ArXiv.

[19]  Gérard Bailly,et al.  Graphical models for social behavior modeling in face-to face interaction , 2016, Pattern Recognit. Lett..

[20]  Haroon Idrees,et al.  Online Localization and Prediction of Actions and Interactions , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Marion Morel,et al.  Automatic evaluation of sports motion: A generic computation of spatial and temporal errors , 2017, Image Vis. Comput..

[22]  Yali Wang,et al.  PA3D: Pose-Action 3D Machine for Video Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Jonathan Tompson,et al.  PersonLab: Person Pose Estimation and Instance Segmentation with a Bottom-Up, Part-Based, Geometric Embedding Model , 2018, ECCV.

[24]  Jenny Benois-Pineau,et al.  3D attention mechanism for fine-grained classification of table tennis strokes using a Twin Spatio-Temporal Convolutional Neural Networks , 2020, 2020 25th International Conference on Pattern Recognition (ICPR).

[25]  Dan Zecha,et al.  Activity-Conditioned Continuous Human Pose Estimation for Performance Analysis of Athletes Using the Example of Swimming , 2018, 2018 IEEE Winter Conference on Applications of Computer Vision (WACV).

[26]  Michal Koperski,et al.  Toyota Smarthome: Real-World Activities of Daily Living , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .

[28]  Jie Li,et al.  Table Tennis Stroke Recognition Based on Body Sensor Network , 2019, IDCS.

[29]  Roman Voeikov,et al.  TTNet: Real-time temporal and spatial video analysis of table tennis , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[30]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[31]  Gang Wang,et al.  Global Context-Aware Attention LSTM Networks for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Cordelia Schmid,et al.  PoTion: Pose MoTion Representation for Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[33]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Vincent Lepetit,et al.  SharpNet: Fast and Accurate Recovery of Occluding Contours in Monocular Depth Estimation , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[35]  Andrew Zisserman,et al.  A Short Note on the Kinetics-700-2020 Human Action Dataset , 2020, ArXiv.

[36]  David A. Clausi,et al.  Pose-Projected Action Recognition Hourglass Network (PARHN) in Soccer , 2019, 2019 16th Conference on Computer and Robot Vision (CRV).

[37]  Ferdinand van der Heijden,et al.  Efficient adaptive density estimation per image pixel for the task of background subtraction , 2006, Pattern Recognit. Lett..

[38]  Jenny Benois-Pineau,et al.  Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks , 2020, Multimedia Tools and Applications.