Action Recognition with Joints-Pooled 3D Deep Convolutional Descriptors

Torso joints can be considered as the landmarks of human body. An action consists of a series of body poses which are determined by the positions of the joints. With the rapid development of RGB-D camera technique and pose estimation research, the acquisition of the body joints has become much easier than before. Thus, we propose to incorporate joint positions with currently popular deep-learned features for action recognition. In this paper, we present a simple, yet effective method to aggregate convolutional activations of a 3D deep convolutional neural network (3D CNN) into discriminative descriptors based on joint positions. Two pooling schemes for mapping body joints into convolutional feature maps are discussed. The joints-pooled 3D deep convolutional descriptors (JDDs) are more effective and robust than the original 3D CNN features and other competing features. We evaluate the proposed descriptors on recognizing both short actions and complex activities. Experimental results on real-world datasets show that our method generates promising results, outperforming state-of-the-art results significantly.

[1]  Cordelia Schmid,et al.  Action recognition by dense trajectories , 2011, CVPR 2011.

[2]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[3]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Jonathan Tompson,et al.  Joint Training of a Convolutional Network and a Graphical Model for Human Pose Estimation , 2014, NIPS.

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Peter Secretan Learning , 1965, Mental Health.

[7]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[8]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[10]  Hanqing Lu,et al.  Spatio-Temporal Triangular-Chain CRF for Activity Recognition , 2015, ACM Multimedia.

[11]  Cordelia Schmid,et al.  P-CNN: Pose-Based CNN Features for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Juan Carlos Niebles,et al.  Discriminative Hierarchical Modeling of Spatio-temporally Composable Human Activities , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[14]  Weiyu Zhang,et al.  From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Cordelia Schmid,et al.  Dense Trajectories and Motion Boundary Descriptors for Action Recognition , 2013, International Journal of Computer Vision.

[16]  Yi Yang,et al.  Learning a 3D Human Pose Distance Metric from Geometric Pose Descriptor , 2011, IEEE Transactions on Visualization and Computer Graphics.

[17]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Ying Wu,et al.  Cross-View Action Modeling, Learning, and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Song-Chun Zhu,et al.  Joint action recognition and pose estimation from video , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Cordelia Schmid,et al.  Towards Understanding Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[22]  Jitendra Malik,et al.  Hypercolumns for object segmentation and fine-grained localization , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Lorenzo Torresani,et al.  C3D: Generic Features for Video Analysis , 2014, ArXiv.

[24]  Cordelia Schmid,et al.  Human Detection Using Oriented Histograms of Flow and Appearance , 2006, ECCV.

[25]  Yi Yang,et al.  Articulated pose estimation with flexible mixtures-of-parts , 2011, CVPR 2011.

[26]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.