Advanced skeleton-based action recognition via spatial-temporal rotation descriptors

As human action is a spatial–temporal process, modern action recognition research has focused on exploring more effective motion representations, rather than only taking human poses as input. To better model a motion pattern, in this paper, we exploit the rotation information to depict the spatial–temporal variation, thus enhancing the dynamic appearance, as well as forming a complementary component with the static coordinates of the joints. Specifically, we design to represent the movement of human body with joint units, consisting of performing regrouping human joints together with the adjacent two bones. Therefore, the rotation descriptors reduce the impact from the static values while focus on the dynamic movement. The proposed general features can be simply applied to existing CNN-based action recognition methods. The experimental results performed on NTU-RGB+D and ICL First Person Handpose datasets demonstrate the advantages of the proposed method.

[1]  Bolei Zhou,et al.  Temporal Relational Reasoning in Videos , 2017, ECCV.

[2]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Hui Li,et al.  DenseFuse: A Fusion Approach to Infrared and Visible Images , 2018, IEEE Transactions on Image Processing.

[4]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[5]  Yutaka Satoh,et al.  Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Kuldeep Singh,et al.  Robust object tracking with crow search optimized multi-cue particle filter , 2019, Pattern Analysis and Applications.

[10]  Josef Kittler,et al.  Joint Group Feature Selection and Discriminative Filter Learning for Robust Visual Object Tracking , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  J. Jost Riemannian geometry and geometric analysis , 1995 .

[13]  Shehroz S. Khan,et al.  Spatio-temporal adversarial learning for detecting unseen falls , 2019, Pattern Analysis and Applications.

[14]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[15]  Yu Qiao,et al.  RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[16]  Xu Chen,et al.  Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[18]  Xudong Lin,et al.  DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[20]  Tinne Tuytelaars,et al.  Rank Pooling for Action Recognition , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Wei Wu,et al.  STM: SpatioTemporal and Motion Encoding for Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[22]  Josef Kittler,et al.  Spatial Residual Layer and Dense Connection Block Enhanced Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[23]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Ni Tao,et al.  Simultaneous identification of points and circles: structure from motion system in industry scenes , 2020 .

[25]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[26]  Jinhui Tang,et al.  Unsupervised Feature Selection via Nonnegative Spectral Analysis and Redundancy Control , 2015, IEEE Transactions on Image Processing.

[27]  Tao Mei,et al.  Deep Collaborative Embedding for Social Image Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Shanxin Yuan,et al.  First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[29]  Wei Zhang,et al.  Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Richard P. Wildes,et al.  Spatiotemporal Residual Networks for Video Action Recognition , 2016, NIPS.

[32]  Cordelia Schmid,et al.  DeepFlow: Large Displacement Optical Flow with Deep Matching , 2013, 2013 IEEE International Conference on Computer Vision.

[33]  Yann LeCun,et al.  A Closer Look at Spatiotemporal Convolutions for Action Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Jinhui Tang,et al.  Robust Structured Nonnegative Matrix Factorization for Image Representation , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[35]  Zhenghao Chen,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Marc Dymetman Group Theory and Computational Linguistics , 1998, J. Log. Lang. Inf..

[37]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.