论文信息 - Human Pose Estimation in Space and Time Using 3D CNN

Human Pose Estimation in Space and Time Using 3D CNN

This paper explores the capabilities of convolutional neural networks to deal with a task that is easily manageable for humans: perceiving 3D pose of a human body from varying angles. However, in our approach, we are restricted to using a monocular vision system. For this purpose, we apply a convolutional neural network approach on RGB videos and extend it to three dimensional convolutions. This is done via encoding the time dimension in videos as the 3\(^\mathrm{rd}\) dimension in convolutional space, and directly regressing to human body joint positions in 3D coordinate space. This research shows the ability of such a network to achieve state-of-the-art performance on the selected Human3.6M dataset, thus demonstrating the possibility of successfully representing temporal data with an additional dimension in the convolutional operation.

[1] Lorenzo Torresani,et al. Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[3] J. Freyd. Dynamic mental representations. , 1987, Psychological review.

[4] M. R. Jones,et al. Time, our lost dimension: toward a new theory of perception, attention, and memory. , 1976, Psychological review.

[5] Vincent Lepetit,et al. Hands Deep in Deep Learning for Hand Pose Estimation , 2015, ArXiv.

[6] Fernando De la Torre,et al. Spatio-Temporal Matching for Human Pose Estimation in Video , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7] Jian Sun,et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[8] Christian Szegedy,et al. DeepPose: Human Pose Estimation via Deep Neural Networks , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9] Jonathan Tompson,et al. Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Kang Zheng,et al. Combining local appearance and holistic view: Dual-Source Deep Neural Networks for human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] Vincent Lepetit,et al. Predicting People's 3D Poses from Short Sequences , 2015, ArXiv.

[12] Wen Gao,et al. Robust Estimation of 3D Human Poses from a Single Image , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13] Xiaowei Zhou,et al. Sparseness Meets Deepness: 3D Human Pose Estimation from Monocular Video , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14] Andrew Zisserman,et al. Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos , 2014, ACCV.

[15] Cristian Sminchisescu,et al. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16] Yan Huang,et al. Full-Body Human Pose Estimation from Monocular Video Sequence via Multi-dimensional Boosting Regression , 2014, ACCV Workshops.

[17] Ning Qian,et al. On the momentum term in gradient descent learning algorithms , 1999, Neural Networks.

[18] Ming Yang,et al. 3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.