3D-based Deep Convolutional Neural Network for action recognition with depth sequences

Traditional algorithms to design hand-crafted features for action recognition have been a hot research area in the last decade. Compared to RGB video, depth sequence is more insensitive to lighting changes and more discriminative due to its capability to catch geometric information of object. Unlike many existing methods for action recognition which depend on well-designed features, this paper studies deep learning-based action recognition using depth sequences and the corresponding skeleton joint information. Firstly, we construct a 3D-based Deep Convolutional Neural Network (3D2CNN) to directly learn spatio-temporal features from raw depth sequences, then compute a joint based feature vector named JointVector for each sequence by taking into account the simple position and angle information between skeleton joints. Finally, support vector machine (SVM) classification results from 3D2CNN learned features and JointVector are fused to take action recognition. Experimental results demonstrate that our method can learn feature representation which is time-invariant and viewpoint-invariant from depth sequences. The proposed method achieves comparable results to the state-of-the-art methods on the UTKinect-Action3D dataset and achieves superior performance in comparison to baseline methods on the MSR-Action3D dataset. We further investigate the generalization of the trained model by transferring the learned features from one dataset (MSR-Action3D) to another dataset (UTKinect-Action3D) without retraining and obtain very promising classification accuracy. Studies machine learning-based action recognition from depth sequencesDirectly learns spatio-temporal features without using hand-crafted featuresOur feature representation is time-invariant and viewpoint-invariant.Outperforms or achieves comparable results to the state-of-the-art methodsThe trained model and learned features can be transferred to different datasets.

[1]  Hao Yu,et al.  Selection of Proper Neural Network Sizes and Architectures—A Comparative Study , 2012, IEEE Transactions on Industrial Informatics.

[2]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[4]  Trevor Darrell,et al.  DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition , 2013, ICML.

[5]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  Yingli Tian,et al.  Edge Enhanced Depth Motion Map for Dynamic Hand Gesture Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  James W. Davis,et al.  The representation and recognition of human movement using temporal templates , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[9]  Clément Farabet,et al.  Torch7: A Matlab-like Environment for Machine Learning , 2011, NIPS 2011.

[10]  Lawrence D. Jackel,et al.  Backpropagation Applied to Handwritten Zip Code Recognition , 1989, Neural Computation.

[11]  J. Kautz,et al.  Hand Gesture Recognition with 3 D Convolutional Neural Networks , 2015 .

[12]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[13]  Pavlo Molchanov,et al.  Hand gesture recognition with 3D convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[14]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[15]  Ling Shao,et al.  Learning Discriminative Representations from RGB-D Video Data , 2013, IJCAI.

[16]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[17]  Cordelia Schmid,et al.  Action Recognition with Improved Trajectories , 2013, 2013 IEEE International Conference on Computer Vision.

[18]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Bruce A. Draper,et al.  Using a Product Manifold distance for unsupervised action recognition , 2012, Image Vis. Comput..

[22]  Guodong Guo,et al.  Evaluating spatiotemporal interest point features for depth-based action recognition , 2014, Image Vis. Comput..

[23]  Ling Shao,et al.  Kernelized Multiview Projection for Robust Action Recognition , 2016, International Journal of Computer Vision.

[24]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Oleg Starostenko,et al.  Recognition of human walking/running actions based on neural network , 2013, 2013 10th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE).

[26]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[27]  Ivan Laptev,et al.  Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[29]  Ling Shao,et al.  Learning Spatio-Temporal Representations for Action Recognition: A Genetic Programming Approach , 2016, IEEE Transactions on Cybernetics.

[30]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[31]  Li Wang,et al.  Structured learning of local features for human action classification and localization , 2012, Image Vis. Comput..

[32]  Richard M. Murray,et al.  A Mathematical Introduction to Robotic Manipulation , 1994 .

[33]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[35]  Lin Sun,et al.  DL-SFA: Deeply-Learned Slow Feature Analysis for Action Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Alberto Del Bimbo,et al.  Space-Time Pose Representation for 3D Human Action Recognition , 2013, ICIAP Workshops.

[37]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[38]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Yong Pei,et al.  Multilevel Depth and Image Fusion for Human Activity Detection , 2013, IEEE Transactions on Cybernetics.

[40]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[41]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[42]  Balaraman Ravindran,et al.  Activity Recognition for Natural Human Robot Interaction , 2014, ICSR.

[43]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[46]  Martin D. Levine,et al.  Human activity recognition in videos using a single example , 2013, Image Vis. Comput..

[47]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[48]  Stefan Harmeling,et al.  Image denoising: Can plain neural networks compete with BM3D? , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Yu Qiao,et al.  Motion boundary based sampling and 3D co-occurrence descriptors for action recognition , 2014, Image Vis. Comput..

[50]  Qionghai Dai,et al.  Free-Viewpoint Video of Human Actors Using Multiple Handheld Kinects , 2013, IEEE Transactions on Cybernetics.

[51]  Yann LeCun,et al.  Pedestrian Detection with Unsupervised Multi-stage Feature Learning , 2012, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[53]  Jinhui Tang,et al.  Hand-Crafted Features or Machine Learnt Features? Together They Improve RGB-D Object Recognition , 2014, 2014 IEEE International Symposium on Multimedia.

[54]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.