3D Human Activity Recognition with Reconfigurable Convolutional Neural Networks

Human activity understanding with 3D/depth sensors has received increasing attention in multimedia processing and interactions. This work targets on developing a novel deep model for automatic activity recognition from RGB-D videos. We represent each human activity as an ensemble of cubic-like video segments, and learn to discover the temporal structures for a category of activities, i.e. how the activities to be decomposed in terms of classification. Our model can be regarded as a structured deep architecture, as it extends the convolutional neural networks (CNNs) by incorporating structure alternatives. Specifically, we build the network consisting of 3D convolutions and max-pooling operators over the video segments, and introduce the latent variables in each convolutional layer manipulating the activation of neurons. Our model thus advances existing approaches in two aspects: (i) it acts directly on the raw inputs (grayscale-depth data) to conduct recognition instead of relying on hand-crafted features, and (ii) the model structure can be dynamically adjusted accounting for the temporal variations of human activities, i.e. the network configuration is allowed to be partially activated during inference. For model training, we propose an EM-type optimization method that iteratively (i) discovers the latent structure by determining the decomposed actions for each training example, and (ii) learns the network parameters by using the back-propagation algorithm. Our approach is validated in challenging scenarios, and outperforms state-of-the-art methods. A large human activity database of RGB-D videos is presented in addition.

[1]  Yong Pei,et al.  Integrating multi-stage depth-induced contextual information for human action recognition and localization , 2013, 2013 10th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[2]  John D. Lafferty,et al.  Learning image representations from the pixel level via hierarchical sparse coding , 2011, CVPR 2011.

[3]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Li Li,et al.  Semantic event representation and recognition using syntactic attribute graph grammar , 2009, Pattern Recognit. Lett..

[5]  Kewei Tu,et al.  Joint Video and Text Parsing for Understanding Events and Answering Queries , 2013, IEEE MultiMedia.

[6]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[7]  Xiaogang Wang,et al.  Pedestrian Parsing via Deep Decompositional Network , 2013, 2013 IEEE International Conference on Computer Vision.

[8]  Antonio Fernández-Caballero,et al.  A survey of video datasets for human action and activity recognition , 2013, Comput. Vis. Image Underst..

[9]  Yunde Jia,et al.  Parsing video events with goal inference and intent prediction , 2011, 2011 International Conference on Computer Vision.

[10]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[11]  Kate Saenko,et al.  A combined pose, object, and feature model for action understanding , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[14]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[16]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[18]  Shuicheng Yan,et al.  Incorporating Structural Alternatives and Sharing into Hierarchy for Multiclass Object Recognition and Detection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Xiaogang Wang,et al.  A Deep Sum-Product Architecture for Robust Facial Attributes Analysis , 2013, 2013 IEEE International Conference on Computer Vision.

[20]  Ying Wu,et al.  Learning Maximum Margin Temporal Warping for Action Recognition , 2013, 2013 IEEE International Conference on Computer Vision.

[21]  Shuicheng Yan,et al.  SIFT-Bag kernel for video event analysis , 2008, ACM Multimedia.

[22]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[23]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[25]  Liang Lin,et al.  Learning latent spatio-temporal compositional model for human action recognition , 2013, MM '13.

[26]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[27]  WangYang,et al.  Hidden Part Models for Human Action Recognition , 2011 .

[28]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[29]  William Brendel,et al.  Learning spatiotemporal graphs of human activities , 2011, 2011 International Conference on Computer Vision.

[30]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[31]  Qi Tian,et al.  Human group activity analysis with fusion of motion and appearance information , 2011, MM '11.

[32]  Alan L. Yuille,et al.  An Approach to Pose-Based Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Fei-Fei Li,et al.  Learning latent temporal structure for complex event detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Yun Fu,et al.  Exploring discriminative pose sub-patterns for effective action classification , 2013, ACM Multimedia.

[35]  Mohamed R. Amer,et al.  Sum-product networks for modeling activities with stochastic structure , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[37]  Hema Swetha Koppula,et al.  Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation , 2013, ICML.

[38]  Yang Wang,et al.  Hidden Part Models for Human Action Recognition: Probabilistic versus Max Margin , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[39]  Deepu Rajan,et al.  Human activities recognition using depth images , 2013, MM '13.

[40]  Jake Porway,et al.  A stochastic graph grammar for compositional object representation and recognition , 2009, Pattern Recognit..