Optimized Skeleton-based Action Recognition via Sparsified Graph Regression

With the prevalence of accessible depth sensors, dynamic human body skeletons have attracted much attention as a robust modality for action recognition. Previous methods model skeletons based on RNN or CNN, which has limited expressive power for irregular skeleton joints. While graph convolutional networks (GCN) have been proposed to address irregular graph-structured data, the fundamental graph construction remains challenging. In this paper, we represent skeletons naturally on graphs, and propose a graph regression based GCN (GR-GCN) for skeleton-based action recognition, aiming to capture the spatio-temporal variation in the data. As the graph representation is crucial to graph convolution, we first propose graph regression to statistically learn the underlying graph from multiple observations. In particular, we provide spatio-temporal modeling of skeletons and pose an optimization problem on the graph structure over consecutive frames, which enforces the sparsity of the underlying graph for efficient representation. The optimized graph not only connects each joint to its neighboring joints in the same frame strongly or weakly, but also links with relevant joints in the previous and subsequent frames. We then feed the optimized graph into the GCN along with the coordinates of the skeleton sequence for feature learning, where we deploy high-order and fast Chebyshev approximation of spectral graph convolution. Further, we provide analysis of the variation characterization by the Chebyshev approximation. Experimental results validate the effectiveness of the proposed graph regression and show that the proposed GR-GCN achieves the state-of-the-art performance on the widely used NTU RGB+D, UT-Kinect and SYSU 3D datasets.

[1]  Wei Liu,et al.  Discriminative Multi-instance Multitask Learning for 3D Action Recognition , 2017, IEEE Transactions on Multimedia.

[2]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[3]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[4]  Guillermo Garcia-Hernando,et al.  Transition Forests: Learning Discriminative Temporal Transitions for Action Recognition and Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[6]  Limin Wang,et al.  Temporal Action Detection with Structured Segment Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[8]  Yansong Tang,et al.  Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  Hassen Drira,et al.  Coding Kendall's Shape Trajectories for 3D Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  P. Vandergheynst,et al.  Accelerated filtering on graphs using Lanczos method , 2015, 1509.04537.

[12]  Shuicheng Yan,et al.  Semantic Object Parsing with Graph LSTM , 2016, ECCV.

[13]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[14]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Shuang Wang,et al.  Skeleton-based action recognition using LSTM and CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[17]  Joan Bruna,et al.  Deep Convolutional Networks on Graph-Structured Data , 2015, ArXiv.

[18]  Leon W. Cohen,et al.  Conference Board of the Mathematical Sciences , 1963 .

[19]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[20]  Oscar C. Au,et al.  Multiresolution Graph Fourier Transform for Compression of Piecewise Smooth Images , 2015, IEEE Transactions on Image Processing.

[21]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Mathias Niepert,et al.  Learning Convolutional Neural Networks for Graphs , 2016, ICML.

[23]  Joan Bruna,et al.  Spectral Networks and Locally Connected Networks on Graphs , 2013, ICLR.

[24]  Amin Zheng,et al.  RGCNN: Regularized Graph CNN for Point Cloud Segmentation , 2018, ACM Multimedia.

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Bing Li,et al.  Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition , 2016, ECCV.

[27]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[28]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[29]  Pierre Vandergheynst,et al.  Wavelets on Graphs via Spectral Graph Theory , 2009, ArXiv.

[30]  Austin Reiter,et al.  Interpretable 3D Human Action Analysis with Temporal Convolutional Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[31]  Gang Wang,et al.  Real-Time RGB-D Activity Prediction by Soft Regression , 2016, ECCV.

[32]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[34]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[35]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[36]  Rushil Anirudh,et al.  Elastic functional coding of human actions: From vector-fields to latent variables , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[38]  Alán Aspuru-Guzik,et al.  Convolutional Networks on Graphs for Learning Molecular Fingerprints , 2015, NIPS.

[39]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[40]  BlakeAndrew,et al.  Real-time human pose recognition in parts from single depth images , 2013 .

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Mari Ostendorf,et al.  Conversation Modeling on Reddit Using a Graph-Structured LSTM , 2017, TACL.

[43]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[44]  Jaejoon Lee,et al.  Edge-adaptive transforms for efficient depth map coding , 2010, 28th Picture Coding Symposium.

[45]  U. Feige,et al.  Spectral Graph Theory , 2015 .

[46]  Luc Van Gool,et al.  Temporal Segment Networks: Towards Good Practices for Deep Action Recognition , 2016, ECCV.

[47]  Yue Zhang,et al.  Sentence-State LSTM for Text Representation , 2018, ACL.

[48]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[49]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[50]  Fei Wu,et al.  Spatio-Temporal Graph Routing for Skeleton-Based Action Recognition , 2019, AAAI.

[51]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[52]  Alan L. Yuille,et al.  Mining 3D Key-Pose-Motifs for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[53]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[55]  Xavier Bresson,et al.  Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering , 2016, NIPS.

[56]  Yue Wang,et al.  Dynamic Graph CNN for Learning on Point Clouds , 2018, ACM Trans. Graph..

[57]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[58]  Jian Yang,et al.  Spatio-Temporal Graph Convolution for Skeleton Based Action Recognition , 2018, AAAI.

[59]  Jian-Huang Lai,et al.  Jointly Learning Heterogeneous Features for RGB-D Activity Recognition , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[61]  Nanyun Peng,et al.  Cross-Sentence N-ary Relation Extraction with Graph LSTMs , 2017, TACL.

[62]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[63]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  F. Scarselli,et al.  A new model for learning in graph domains , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[65]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[66]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[67]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[68]  Junsong Yuan,et al.  Spatio-Temporal Naive-Bayes Nearest-Neighbor (ST-NBNN) for Skeleton-Based Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Pascal Frossard,et al.  The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains , 2012, IEEE Signal Processing Magazine.

[70]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).