论文信息 - SequentialPointNet: A strong parallelized point cloud sequence network for 3D action recognition

SequentialPointNet: A strong parallelized point cloud sequence network for 3D action recognition

Point cloud sequences of 3D human actions exhibit unordered intra-frame spatial information and ordered interframe temporal information. In order to capture the spatiotemporal structures of the point cloud sequences, cross-frame spatio-temporal local neighborhoods around the centroids are usually constructed. However, the computationally expensive construction procedure of spatio-temporal local neighborhoods severely limits the parallelism of models. Moreover, it is unreasonable to treat spatial and temporal information equally in spatio-temporal local learning, because human actions are complicated along the spatial dimensions and simple along the temporal dimension. In this paper, to avoid spatio-temporal local encoding, we propose a strong parallelized point cloud sequence network referred to as SequentialPointNet for 3D action recognition. SequentialPointNet is composed of two serial modules, i.e., an intra-frame appearance encoding module and an inter-frame motion encoding module. For modeling the strong spatial structures of human actions, each point cloud frame is processed in parallel in the intra-frame appearance encoding module and the feature vector of each frame is output to form a feature vector sequence that characterizes static appearance changes along the temporal dimension. For modeling the weak temporal changes of human actions, in the inter-frame motion encoding module, the temporal position encoding and the hierarchical pyramid pooling strategy are implemented on the feature vector sequence. In addition, in order to better explore spatio-temporal content, multiple level features of human movements are aggregated before performing the end-to-end 3D action recognition. Extensive experiments conducted on three public datasets show that SequentialPointNet outperforms stateof-the-art approaches. SequentialPointNet’s code is available at https://github.com/XingLi1012/SequentialPointNet.git.

[1] Xiaohui Xie,et al. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[2] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Yi Yang,et al. Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[4] Cordelia Schmid,et al. A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[5] Leonidas J. Guibas,et al. Volumetric and Multi-view CNNs for Object Classification on 3D Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6] Wanqing Li,et al. Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[7] Liwei Wang,et al. Learning Relationships for Multi-View 3D Object Recognition , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[8] Anath Fischer,et al. 3D Point Cloud Classification and Segmentation using 3D Modified Fisher Vector Representation for Convolutional Neural Networks , 2017, ArXiv.

[9] Wojciech Zaremba,et al. An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[10] Ye Zhang,et al. A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification , 2015, IJCNLP.

[11] Zhenghao Chen,et al. Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Subhransu Maji,et al. 3D Shape Segmentation with Projective Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Matthew Korban,et al. DDGCN: A Dynamic Directed Graph Convolutional Network for Action Recognition , 2020, ECCV.

[14] Qi Tian,et al. Symbiotic Graph Neural Networks for 3D Skeleton-Based Human Action Recognition and Motion Prediction , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Yan Zhang,et al. 3D shape segmentation via shape fully convolutional networks , 2017, Comput. Graph..

[16] Gang Wang,et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Wei An,et al. Learning Multi-View Representation With LSTM for 3-D Shape Recognition and Retrieval , 2019, IEEE Transactions on Multimedia.

[18] Bingbing Ni,et al. Modeling Point Clouds With Self-Attention and Gumbel Subset Sampling , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19] Bingbing Ni,et al. 3D Human Action Representation Learning via Cross-View Consistency Pursuit , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Yue Gao,et al. GVCNN: Group-View Convolutional Neural Networks for 3D Shape Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21] Dahua Lin,et al. Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[22] Gang Wang,et al. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23] Junsong Yuan,et al. Multi-view Harmonized Bilinear Network for 3D Object Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Gernot Riegler,et al. OctNet: Learning Deep 3D Representations at High Resolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Gang Wang,et al. Skeleton-Based Action Recognition Using Spatio-Temporal LSTM Network with Trust Gates , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26] Yu Qiao,et al. RPAN: An End-to-End Recurrent Pose-Attention Network for Action Recognition in Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] Subhransu Maji,et al. Multi-view Convolutional Neural Networks for 3D Shape Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[28] Chao Li,et al. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[29] Ying Wu,et al. Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[30] Lei Shi,et al. Two-Stream Adaptive Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[31] Adrian Sanchez-Caballero,et al. Exploiting the ConvLSTM: Human Action Recognition using Raw Depth Video-Based Recurrent Neural Networks , 2020, ArXiv.

[32] Xilin Chen,et al. An Efficient PointLSTM for Point Clouds Based Gesture Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[33] Leonidas J. Guibas,et al. SyncSpecCNN: Synchronized Spectral CNN for 3D Shape Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Yang Xiao,et al. Action Recognition for Depth Video using Multi-view Dynamic Images , 2018, Inf. Sci..

[35] Mohan S. Kankanhalli,et al. PSTNet: Point Spatio-Temporal Convolution on Point Cloud Sequences , 2022, ICLR.

[36] Leonidas J. Guibas,et al. CoSegNet: Deep Co-Segmentation of 3D Shapes with Group Consistency Loss , 2019, ArXiv.

[37] Pichao Wang,et al. Depth Pooling Based Large-Scale 3-D Action Recognition With Convolutional Neural Networks , 2018, IEEE Transactions on Multimedia.

[38] Xiaodong Yang,et al. Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[39] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[40] Jeannette Bohg,et al. MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[41] Jiwen Lu,et al. Structural Relational Reasoning of Point Clouds , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[42] Leonidas J. Guibas,et al. PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space , 2017, NIPS.

[43] Silvio Savarese,et al. 3D Semantic Parsing of Large-Scale Indoor Spaces , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44] Ye Duan,et al. PointGrid: A Deep Network for 3D Shape Understanding , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[45] Sebastian Scherer,et al. VoxNet: A 3D Convolutional Neural Network for real-time object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[46] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[47] Qinghua Huang,et al. Learning Shape-Motion Representations from Geometric Algebra Spatio-Temporal Model for Skeleton-Based Action Recognition , 2019, 2019 IEEE International Conference on Multimedia and Expo (ICME).

[48] Leonidas J. Guibas,et al. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Chi-Wing Fu,et al. PointWeb: Enhancing Local Neighborhood Features for Point Cloud Processing , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[50] Mario Fernando Montenegro Campos,et al. STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[51] In-So Kweon,et al. CBAM: Convolutional Block Attention Module , 2018, ECCV.

[52] Seong-Whan Lee,et al. View-independent human action recognition with Volume Motion Template on single stereo camera , 2010, Pattern Recognit. Lett..

[53] Yoshua Bengio,et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[54] Phil Blunsom,et al. Teaching Machines to Read and Comprehend , 2015, NIPS.

[55] Yingli Tian,et al. Self-supervised 4D Spatio-temporal Feature Learning via Order Prediction of Sequential Point Cloud Clips , 2021, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV).

[56] Marta Marrón Romera,et al. 3DFCNN: real-time action recognition using 3D deep neural networks with raw depth information , 2020, Multimedia Tools and Applications.

[57] Jianxiong Xiao,et al. 3D ShapeNets: A deep representation for volumetric shapes , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[58] Lei Shi,et al. Skeleton-Based Action Recognition With Directed Graph Neural Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Fu Xiong,et al. 3DV: 3D Dynamic Voxel for Action Recognition in Depth Video , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[60] Shuang Wang,et al. Structured Images for RGB-D Action Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[61] Jiwen Lu,et al. 3DCNN-DQN-RNN: A Deep Reinforcement Learning Framework for Semantic Parsing of Large-Scale 3D Point Clouds , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[62] Jing Zhang,et al. Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[63] Ajmal Mian,et al. 3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[64] Xiaoming Liu,et al. On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[65] M. Harville,et al. Fast, integrated person tracking and activity recognition with plan-view templates from a single stereo camera , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[66] Ron Kimmel,et al. Momen^et: Flavor the Moments in Learning to Classify Shapes , 2018, 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

[67] Feng Lu,et al. VoxSegNet: Volumetric CNNs for Semantic Part Segmentation of 3D Shapes , 2018, IEEE Transactions on Visualization and Computer Graphics.