论文信息 - Action Recognition using Deep Convolutional Neural Networks and Compressed Spatio-Temporal Pose Encodings

Action Recognition using Deep Convolutional Neural Networks and Compressed Spatio-Temporal Pose Encodings

Convolutional neural networks have recently shown proficiency atrecognizing actions in RGB video. Existing models are gener-ally very deep, requiring large amounts of data to train effectively.Moreover, they rely mainly on global appearance and could poten-tially underperform in single-environment applications, such as asports event. To overcome these limitations, we propose to short-cut spatial learning by leveraging the activations within a humanpose estimation network. The proposed framework integrates ahuman pose estimation network with a convolutional classifier viacompressed encodings of pose activations. When evaluated onUTD-MHAD, a 27-class multimodal dataset, the pose-based RGBaction recognition model achieves a classification accuracy of 98.4%in a subject-specific experiment and outperforms a baseline methodthat fuses depth and inertial sensor data.

[1] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Nasser Kehtarnavaz,et al. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[3] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4] Jia Deng,et al. Stacked Hourglass Networks for Human Pose Estimation , 2016, ECCV.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[7] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[8] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[9] Yaser Sheikh,et al. OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Nasser Kehtarnavaz,et al. A Real-Time Human Action Recognition System Using Depth and Inertial Sensor Fusion , 2016, IEEE Sensors Journal.

[12] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Hans-Peter Seidel,et al. VNect , 2017, ACM Trans. Graph..

[14] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[15] Gang Yu,et al. Cascaded Pyramid Network for Multi-person Pose Estimation , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.