SequentialPointNet: A strong parallelized point cloud sequence network for 3D action recognition

Point cloud sequences of 3D human actions exhibit unordered intra-frame spatial information and ordered interframe temporal information. In order to capture the spatiotemporal structures of the point cloud sequences, cross-frame spatio-temporal local neighborhoods around the centroids are usually constructed. However, the computationally expensive construction procedure of spatio-temporal local neighborhoods severely limits the parallelism of models. Moreover, it is unreasonable to treat spatial and temporal information equally in spatio-temporal local learning, because human actions are complicated along the spatial dimensions and simple along the temporal dimension. In this paper, to avoid spatio-temporal local encoding, we propose a strong parallelized point cloud sequence network referred to as SequentialPointNet for 3D action recognition. SequentialPointNet is composed of two serial modules, i.e., an intra-frame appearance encoding module and an inter-frame motion encoding module. For modeling the strong spatial structures of human actions, each point cloud frame is processed in parallel in the intra-frame appearance encoding module and the feature vector of each frame is output to form a feature vector sequence that characterizes static appearance changes along the temporal dimension. For modeling the weak temporal changes of human actions, in the inter-frame motion encoding module, the temporal position encoding and the hierarchical pyramid pooling strategy are implemented on the feature vector sequence. In addition, in order to better explore spatio-temporal content, multiple level features of human movements are aggregated before performing the end-to-end 3D action recognition. Extensive experiments conducted on three public datasets show that SequentialPointNet outperforms stateof-the-art approaches. SequentialPointNet’s code is available at https://github.com/XingLi1012/SequentialPointNet.git.

[1]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[2]  Trevor Darrell,et al.  Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Yi Yang,et al.  Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[5]  Leonidas J. Guibas,et al.  Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7]  Liwei Wang,et al.  Learning Relationships for Multi-View 3D Object Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8]  Anath Fischer,et al.  3D Point Cloud Classification and Segmentation using 3D Modified Fisher Vector Representation for Convolutional Neural Networks , 2017, ArXiv.

[9]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[10]  Ye Zhang,et al.  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[11]  Zhenghao Chen,et al.  Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Subhransu Maji,et al.  3D Shape Segmentation with Projective Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Matthew Korban,et al.  DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition , 2020, ECCV.

[14]  Qi Tian,et al.  Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action Recognition and Motion Prediction , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Yan Zhang,et al.  3D shape segmentation via shape fully convolutional networks , 2017, Comput. Graph..

[16]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Wei An,et al.  Learning Multi-View Representation With LSTM for 3-D Shape Recognition and Retrieval , 2019, IEEE Transactions on Multimedia.

[18]  Bingbing Ni,et al.  Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Bingbing Ni,et al.  3D Human Action Representation Learning via Cross-View Consistency Pursuit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Yue Gao,et al.  GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[22]  Gang Wang,et al.  NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Junsong Yuan,et al.  Multi-view Harmonized Bilinear Network for 3D Object Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Gernot Riegler,et al.  OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Gang Wang,et al.  Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Yu Qiao,et al.  RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Subhransu Maji,et al.  Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[29]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Lei Shi,et al.  Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Adrian Sanchez-Caballero,et al.  Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based Recurrent Neural Networks , 2020, ArXiv.

[32]  Xilin Chen,et al.  An Efficient PointLSTM for Point Clouds Based Gesture Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Leonidas J. Guibas,et al.  SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Yang Xiao,et al.  Action Recognition for Depth Video using Multi-view Dynamic Images , 2018, Inf. Sci..

[35]  Mohan S. Kankanhalli,et al.  PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences , 2022, ICLR.

[36]  Leonidas J. Guibas,et al.  CoSegNet: Deep Co-Segmentation of 3D Shapes with Group Consistency Loss , 2019, ArXiv.

[37]  Pichao Wang,et al.  Depth Pooling Based Large-Scale 3-D Action Recognition With Convolutional Neural Networks , 2018, IEEE Transactions on Multimedia.

[38]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[39]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[40]  Jeannette Bohg,et al.  MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41]  Jiwen Lu,et al.  Structural Relational Reasoning of Point Clouds , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Leonidas J. Guibas,et al.  PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[43]  Silvio Savarese,et al.  3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Ye Duan,et al.  PointGrid: A Deep Network for 3D Shape Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45]  Sebastian Scherer,et al.  VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[47]  Qinghua Huang,et al.  Learning Shape-Motion Representations from Geometric Algebra Spatio-Temporal Model for Skeleton-Based Action Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[48]  Leonidas J. Guibas,et al.  PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Chi-Wing Fu,et al.  PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Mario Fernando Montenegro Campos,et al.  STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[51]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[52]  Seong-Whan Lee,et al.  View-independent human action recognition with Volume Motion Template on single stereo camera , 2010, Pattern Recognit. Lett..

[53]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[54]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[55]  Yingli Tian,et al.  Self-supervised 4D Spatio-temporal Feature Learning via Order Prediction of Sequential Point Cloud Clips , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[56]  Marta Marrón Romera,et al.  3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information , 2020, Multimedia Tools and Applications.

[57]  Jianxiong Xiao,et al.  3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58]  Lei Shi,et al.  Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Fu Xiong,et al.  3DV: 3D Dynamic Voxel for Action Recognition in Depth Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Shuang Wang,et al.  Structured Images for RGB-D Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[61]  Jiwen Lu,et al.  3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic Parsing of Large-Scale 3D Point Clouds , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[63]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[65]  M. Harville,et al.  Fast, integrated person tracking and activity recognition with plan-view templates from a single stereo camera , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[66]  Ron Kimmel,et al.  Momen^et: Flavor the Moments in Learning to Classify Shapes , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[67]  Feng Lu,et al.  VoxSegNet: Volumetric CNNs for Semantic Part Segmentation of 3D Shapes , 2018, IEEE Transactions on Visualization and Computer Graphics.