Learning to recognise 3D human action from a new skeleton-based representation using deep convolutional neural networks

Recognising human actions in untrimmed videos is an important challenging task. An effective three-dimensional (3D) motion representation and a powerful learning model are two key factors influencing recognition performance. In this study, the authors introduce a new skeleton-based representation for 3D action recognition in videos. The key idea of the proposed representation is to transform 3D joint coordinates of the human body carried in skeleton sequences into RGB images via a colour encoding process. By normalising the 3D joint coordinates and dividing each skeleton frame into five parts, where the joints are concatenated according to the order of their physical connections, the colour-coded representation is able to represent spatio-temporal evolutions of complex 3D motions, independently of the length of each sequence. They then design and train different deep convolutional neural networks based on the residual network architecture on the obtained image-based representations to learn 3D motion features and classify them into classes. Their proposed method is evaluated on two widely used action recognition benchmarks: MSR Action3D and NTU-RGB+D, a very large-scale dataset for 3D human action recognition. The experimental results demonstrate that the proposed method outperforms previous state-of-the-art approaches while requiring less computation for training and prediction.

[1]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Pichao Wang,et al.  Action Recognition Based on Joint Trajectory Maps Using Convolutional Neural Networks , 2016, ACM Multimedia.

[3]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[4]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Wei Niu,et al.  Human activity detection and recognition for video surveillance , 2004, 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763).

[6]  Gholamreza Akbarizadeh,et al.  Optimized fuzzy cellular automata for synthetic aperture radar image edge detection , 2018 .

[7]  Tara N. Sainath,et al.  Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Jian Sun,et al.  Convolutional neural networks at constrained time cost , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Martial Hebert,et al.  Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification , 2016, ECCV.

[11]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Geoffrey E. Hinton,et al.  Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[14]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[15]  Shuxin Qin,et al.  Gesture recognition from depth images using motion and shape features , 2013, 2013 2nd International Symposium on Instrumentation and Measurement, Sensor Network and Automation (IMSNA).

[16]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[17]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[18]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[19]  Kai Liu,et al.  Profile HMMs for skeleton-based human action recognition , 2016, Signal Process. Image Commun..

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[22]  Andrew J. Davison,et al.  Real-time simultaneous localisation and mapping with a single camera , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[23]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[25]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Douglas B. Williams,et al.  Detection and identification of human targets in radar data , 2007, SPIE Defense + Commercial Sensing.

[28]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[29]  Jing Zhang,et al.  RGB-D-based action recognition datasets: A survey , 2016, Pattern Recognit..

[30]  Mario Fernando Montenegro Campos,et al.  On the improvement of human action recognition from depth map sequences using Space-Time Occupancy Patterns , 2014, Pattern Recognit. Lett..

[31]  Georgios Evangelidis,et al.  Skeletal Quads: Human Action Recognition Using Joint Quadruples , 2014, 2014 22nd International Conference on Pattern Recognition.

[32]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Ramakant Nevatia,et al.  Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost , 2006, ECCV.

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Mario Fernando Montenegro Campos,et al.  STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[36]  Yanbing Xue,et al.  Human Action Recognition Via Multi-modality Information , 2014 .

[37]  Gholamreza Akbarizadeh,et al.  Unsupervised Texture-Based SAR Image Segmentation Using Spectral Regression and Gabor Filter Bank , 2016, Journal of the Indian Society of Remote Sensing.

[38]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[39]  Matus Telgarsky,et al.  Benefits of Depth in Neural Networks , 2016, COLT.

[40]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[41]  Iasonas Kokkinos,et al.  DensePose: Dense Human Pose Estimation in the Wild , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Radha Poovendran,et al.  Human activity recognition for video surveillance , 2008, 2008 IEEE International Symposium on Circuits and Systems.

[43]  Michael J. Black,et al.  Parameterized Modeling and Recognition of Activities , 1999, Comput. Vis. Image Underst..

[44]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[45]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[46]  Byung-Jun Yoon,et al.  Hidden Markov Models and their Applications in Biological Sequence Analysis , 2009, Current genomics.

[47]  Lihong Zheng,et al.  Three Dimensional Motion Trail Model for Gesture Recognition , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[48]  Nasser Kehtarnavaz,et al.  Action Recognition from Depth Sequences Using Depth Motion Maps-Based Local Binary Patterns , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[49]  E HintonGeoffrey,et al.  ImageNet classification with deep convolutional neural networks , 2017 .

[50]  Yi Zhang,et al.  Improved Key Poses Model for Skeleton-Based Action Recognition , 2017, PCM.

[51]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[53]  Keith J. Burnham,et al.  A Research Study of Hand Gesture Recognition Technologies and Applications for Human Vehicle Interaction , 2007 .

[54]  Luiz Velho,et al.  Kinect and RGBD Images: Challenges and Applications , 2012, 2012 25th SIBGRAPI Conference on Graphics, Patterns and Images Tutorials.

[55]  Mohammed Sadgal,et al.  Skeleton-based human activity recognition for elderly monitoring systems , 2018, IET Comput. Vis..

[56]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[57]  Alexandros André Chaaraoui,et al.  Fusion of Skeletal and Silhouette-Based Features for Human Action Recognition with RGB-D Devices , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[58]  Danail Stoyanov,et al.  Ambient and Wearable Sensor Fusion for Activity Recognition in Healthcare Monitoring Systems , 2007, BSN.

[59]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[60]  Yaser Sheikh,et al.  OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Gholamreza Akbarizadeh,et al.  A New Statistical-Based Kurtosis Wavelet Energy Feature for Texture Recognition of SAR Images , 2012, IEEE Transactions on Geoscience and Remote Sensing.

[63]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[64]  Ripul Ghosh,et al.  Deep learning approach for human action recognition in infrared images , 2018, Cognitive Systems Research.

[65]  Ennio Gambi,et al.  Evaluation of a skeleton-based method for human activity recognition on a large-scale RGB-D dataset , 2016 .

[66]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[67]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[68]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[69]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[70]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[71]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2016, Journal of Real-Time Image Processing.

[72]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Ashwini Patil,et al.  Hand Gesture Recognition for Real Time Human Machine Interaction System , 2015 .

[74]  Ling Guan,et al.  Spatio-Temporal Pyramid Model based on depth maps for action recognition , 2015, 2015 IEEE 17th International Workshop on Multimedia Signal Processing (MMSP).

[75]  Dimitris Kastaniotis,et al.  Pose-based human action recognition via sparse representation in dissimilarity space , 2014, J. Vis. Commun. Image Represent..

[76]  David Picard,et al.  Learning features combination for human action recognition from skeleton sequences , 2017, Pattern Recognit. Lett..

[77]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[78]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[79]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[81]  Louahdi Khoudour,et al.  Learning and Recognizing Human Action from Skeleton Movement with Deep Residual Neural Networks , 2018, ArXiv.

[82]  Fadi Al Machot,et al.  A review on applications of activity recognition systems with regard to performance and evaluation , 2016, Int. J. Distributed Sens. Networks.

[83]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[84]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[85]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[86]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[87]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[88]  Ruzena Bajcsy,et al.  Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.