Combining CNN streams of RGB-D and skeletal data for human activity recognition

Abstract Inspired by the success of deep learning methods, for human activity recognition based on individual vision cues, this paper presents a ConvNets based approach for activity recognition by combining multiple vision cues. Moreover, a new method of creating skeleton images, from skeleton joint sequences, representing motion information is presented in this paper. Motion representation images, namely, Motion History Image (MHI), Depth Motion Maps (DMMs) and skeleton images are constructed from RGB, depth and skeletal data of RGB-D sensor. These images are then separately trained on ConvNets and respective softmax scores are fused at the decision level. The combination of these distinct vision cues, leads to complete utilization of data, available from RGB-D sensor. To evaluate the effectiveness of the proposed 5-CNNs approach, we conduct our experiments on three well known and challenging RGB-D datasets, CAD-60, SBU Kinect interaction and UTD-MHAD. Results show that the proposed approach of combining multiple cues by means of decision level fusion is competitive with other state of the art methods.

[1]  Peng Wang,et al.  Temporal Pyramid Pooling-Based Convolutional Neural Network for Action Recognition , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[2]  Pichao Wang,et al.  Mining Mid-Level Features for Action Recognition Based on Effective Skeleton Representation , 2014, 2014 International Conference on Digital Image Computing: Techniques and Applications (DICTA).

[3]  Meng Wang,et al.  A Deep Structured Model with Radius–Margin Bound for 3D Human Activity Recognition , 2015, International Journal of Computer Vision.

[4]  Guodong Guo,et al.  Evaluating spatiotemporal interest point features for depth-based action recognition , 2014, Image Vis. Comput..

[5]  Sargur N. Srihari,et al.  Decision Combination in Multiple Classifier Systems , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Marco Morana,et al.  Human Activity Recognition Process Using 3-D Posture Data , 2015, IEEE Transactions on Human-Machine Systems.

[8]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Nasser Kehtarnavaz,et al.  Improving Human Action Recognition Using Fusion of Depth Camera and Inertial Sensors , 2015, IEEE Transactions on Human-Machine Systems.

[10]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[11]  Yifeng He,et al.  Human action recognition via multiview discriminative analysis of canonical correlations , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[12]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Evangelos Triantaphyllou,et al.  Multi-Criteria Decision Making: An Operations Research Approach , 1998 .

[14]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[15]  Jinwen Ma,et al.  DMMs-Based Multiple Features Fusion for Human Action Recognition , 2015, Int. J. Multim. Data Eng. Manag..

[16]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Xiangtao Zheng,et al.  A discriminative representation for human action recognition , 2016, Pattern Recognit..

[18]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[19]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[20]  Stefan Wermter,et al.  Self-organizing neural integration of pose-motion features for human action recognition , 2015, Front. Neurorobot..

[21]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[22]  D. Ruta,et al.  An Overview of Classifier Fusion Methods , 2000 .

[23]  Chokri Ben Amar,et al.  Human action recognition based on multi-layer Fisher vector encoding method , 2015, Pattern Recognit. Lett..

[24]  Wolfram Burgard,et al.  Multimodal deep learning for robust RGB-D object recognition , 2015, 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[25]  Yong Du,et al.  Skeleton based action recognition with convolutional neural network , 2015, 2015 3rd IAPR Asian Conference on Pattern Recognition (ACPR).

[26]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Edwin Escobedo,et al.  A New Approach for Dynamic Gesture Recognition Using Skeleton Trajectory Representation and Histograms of Cumulative Magnitudes , 2016, 2016 29th SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI).

[28]  Nasser Kehtarnavaz,et al.  Fusion of depth, skeleton, and inertial data for human action recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[29]  Hairong Qi,et al.  Spatio-temporal feature extraction and representation for RGB-D human action recognition , 2014, Pattern Recognit. Lett..

[30]  Yifeng He,et al.  Human gesture recognition via bag of angles for 3D virtual city planning in CAVE environment , 2016, 2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP).

[31]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[32]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[33]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Evangelos Triantaphyllou,et al.  Multi-Criteria Decision Making Methods , 2000 .

[35]  Lianzheng Ge,et al.  Three-stream CNNs for action recognition , 2017, Pattern Recognit. Lett..

[36]  Ruzena Bajcsy,et al.  Berkeley MHAD: A comprehensive Multimodal Human Action Database , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[37]  Ennio Gambi,et al.  A Human Activity Recognition System Using Skeleton Data from RGBD Sensors , 2016, Comput. Intell. Neurosci..

[38]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[39]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2013, Journal of Real-Time Image Processing.

[40]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Robert P. W. Duin,et al.  The combining classifier: to train or not to train? , 2002, Object recognition supported by user interaction for service robots.

[42]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[43]  Javed Imran,et al.  Human action recognition using RGB-D sensor and deep convolutional neural networks , 2016, 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[44]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[45]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[46]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[47]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[48]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[49]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[50]  Limin Wang,et al.  Action recognition with trajectory-pooled deep-convolutional descriptors , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[52]  Md. Atiqur Rahman Ahad,et al.  Motion history image: its variants and applications , 2012, Machine Vision and Applications.

[53]  Jing Zhang,et al.  ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring , 2015, ACM Multimedia.

[54]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[55]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[56]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..