论文信息 - Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation

Skeleton-based human action recognition has recently drawn increasing attentions with the availability of large-scale skeleton datasets. The most crucial factors for this task lie in two aspects: the intra-frame representation for joint co-occurrences and the inter-frame representation for skeletons' temporal evolutions. In this paper we propose an end-to-end convolutional co-occurrence feature learning framework. The co-occurrence features are learned with a hierarchical methodology, in which different levels of contextual information are aggregated gradually. Firstly point-level information of each joint is encoded independently. Then they are assembled into semantic representation in both spatial and temporal domains. Specifically, we introduce a global spatial aggregation scheme, which is able to learn superior joint co-occurrence features over local aggregation. Besides, raw skeleton coordinates as well as their temporal difference are integrated with a two-stream paradigm. Experiments show that our approach consistently outperforms other state-of-the-arts on action recognition and detection benchmarks like NTU RGB+D, SBU Kinect Interaction and PKU-MMD.

[1] Chia-Chih Chen,et al. J.K.: View invariant human action recognition using histograms of 3D joints , 2012 .

[2] Kate Saenko,et al. R-C3D: Region Convolutional 3D Network for Temporal Activity Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[3] Mingyi He,et al. Skeleton boxes: Solving skeleton based action detection with a single deep convolutional neural network , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[4] Chao Li,et al. Cascade Region Proposal and Global Context for Deep Object Detection , 2017, Neurocomputing.

[5] Dimitris Samaras,et al. Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[6] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[8] Trevor Darrell,et al. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Wenjun Zeng,et al. Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Chao Li,et al. Skeleton-based action recognition with convolutional neural networks , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[12] Jiaying Liu,et al. PKU-MMD: A Large Scale Benchmark for Skeleton-Based Human Action Understanding , 2017, VSCC '17.

[13] Marwan Torki,et al. Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[14] Mohammed Bennamoun,et al. A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Jake K. Aggarwal,et al. View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[16] Junsong Yuan,et al. Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17] Xiaohui Xie,et al. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[18] Hong Cheng,et al. Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[19] Jiaying Liu,et al. PKU-MMD: A Large Scale Benchmark for Continuous Multi-Modal Human Action Understanding , 2017, ArXiv.

[20] Gang Wang,et al. Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[21] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22] Nanning Zheng,et al. View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Tara N. Sainath,et al. Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24] Wenjun Zeng,et al. An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[25] Gang Wang,et al. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26] Yong Du,et al. Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[27] Ho-Jin Choi,et al. Essential Body-Joint and Atomic Action Detection for Human Activity Recognition Using Longest Common Subsequence Algorithm , 2012, ACCV Workshops.