SGM-Net: Skeleton-guided multimodal network for action recognition

Abstract Single-modality human action recognition on RGB or skeleton has been extensively studied. Each of these two modalities has its own advantages as well as limitations, because they depict action from different perspectives. The feature of different modalities can complement each other for describing actions. Therefore, it is meaningful to fuse these two modalities using their complementarity for action recognition. However, existing multimodal methods fail to fully exploit the complementarity of RGB and skeleton modalities. In this paper, we propose a Skeleton-Guided Multimodal Network (SGM-Net) for human action recognition. The proposed method takes full use of the complementarity of these two modalities at semantic feature level. From the technical perspective, we introduce a guided block, the key component of SGM-Net. It enables skeleton feature to guide on RGB feature, so that the important RGB information strongly related to the action is enhanced. Moreover, in the guided block, two schemes of correlation operation are explored. We perform a series of ablation experiments to verify the effectiveness of the guided block. The experimental results show that our approach achieves state-of-the-art performance over the existing methods on NTU and Sub-JHMDB datasets.

[1]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[2]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[3]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[5]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Liang Wang,et al.  Part-aligned pose-guided recurrent network for action recognition , 2019, Pattern Recognit..

[7]  Thomas Brox,et al.  Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[8]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[9]  Baoxin Li,et al.  Multi-stream CNN: Learning representations based on human-related regions for action recognition , 2018, Pattern Recognit..

[10]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[11]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Wenjun Zeng,et al.  Skeleton-Indexed Deep Multi-Modal Feature Learning for High Performance Human Action Recognition , 2018, 2018 IEEE International Conference on Multimedia and Expo (ICME).

[13]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Jian-Huang Lai,et al.  Deep Bilinear Learning for RGB-D Action Recognition , 2018, ECCV.

[16]  Tao Mei,et al.  Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[19]  Juan José Pantrigo,et al.  Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition , 2018, Pattern Recognit..

[20]  Juan Carlos Niebles,et al.  Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification , 2010, ECCV.

[21]  Ruigang Yang,et al.  Semi-Supervised Video Object Segmentation with Super-Trajectories , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[23]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[24]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[25]  Rasmus Pagh,et al.  Fast and scalable polynomial kernels via explicit feature maps , 2013, KDD.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[27]  Wenjun Zeng,et al.  Multi-Modality Multi-Task Recurrent Neural Network for Online Action Detection , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[28]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Hao Yang,et al.  Time-Asymmetric 3d Convolutional Neural Networks for Action Recognition , 2019, 2019 IEEE International Conference on Image Processing (ICIP).

[30]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[31]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Hanqing Lu,et al.  Body Joint Guided 3-D Deep Convolutional Descriptors for Action Recognition , 2018, IEEE Transactions on Cybernetics.

[34]  Song-Chun Zhu,et al.  Learning Pose Grammar to Encode Human Body Configuration for 3D Pose Estimation , 2017, AAAI.

[35]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[37]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Yu Qiao,et al.  RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Ling Shao,et al.  Learning Compositional Neural Information Fusion for Human Parsing , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Song-Chun Zhu,et al.  Learning Human-Object Interactions by Graph Parsing Neural Networks , 2018, ECCV.

[42]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[44]  Christian D. Schunn,et al.  Integrating perceptual and cognitive modeling for adaptive and intelligent human-computer interaction , 2002, Proc. IEEE.

[45]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[47]  Shuai Li,et al.  Independently Recurrent Neural Network (IndRNN): Building A Longer and Deeper RNN , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[48]  Juan Carlos Niebles,et al.  A Hierarchical Pose-Based Approach to Complex Action Understanding Using Dictionaries of Actionlets and Motion Poselets , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Dacheng Tao,et al.  Graph Edge Convolutional Neural Networks for Skeleton-Based Action Recognition , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[50]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[51]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).