RGB-D sensing based human action and interaction analysis: A survey

Abstract Human activity recognition has been actively studied in the last three decades. Compared to human action performed by a single person, human interaction is more complex due to the involvement of more subjects and the interdependence between them. Recently, motivated by the remarkable success of deep learning techniques, many learning-based feature representations have been developed for activity recognition. This paper provides a comprehensive review of human action and interaction recognition methods, covering both hand-crafted features and learning-based features, with a special focus on data captured by RGB-D sensors. Furthermore, this review reveals practical challenges in human activity analysis along with their promising solutions and potential future directions.

[1]  Behrooz Mahasseni,et al.  Regularizing Long Short Term Memory with 3D Human-Skeleton Sequences for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Ajmal Mian,et al.  3D Action Recognition from Novel Viewpoints , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Tieniu Tan,et al.  Skeleton-Based Action Recognition with Spatial Reasoning and Temporal Stack Learning , 2018, ECCV.

[4]  Nanning Zheng,et al.  View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skeleton Data , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[5]  Thuong Le-Tien,et al.  PAM-based flexible generative topic model for 3D interactive activity recognition , 2015, 2015 International Conference on Advanced Technologies for Communications (ATC).

[6]  Nasser Kehtarnavaz,et al.  UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor , 2015, 2015 IEEE International Conference on Image Processing (ICIP).

[7]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[8]  Li Ma,et al.  Coupled hidden conditional random fields for RGB-D human action recognition , 2015, Signal Process..

[9]  Youfu Li,et al.  DSRF: A flexible trajectory descriptor for articulated human action recognition , 2018, Pattern Recognit..

[10]  Rama Chellappa,et al.  Rolling Rotations for Recognizing Human Actions from 3D Skeletal Data , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Ling Shao,et al.  Deep Dynamic Neural Networks for Multimodal Gesture Segmentation and Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Ennio Gambi,et al.  Radar and RGB-Depth Sensors for Fall Detection: A Review , 2017, IEEE Sensors Journal.

[13]  Wenjun Zeng,et al.  Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[14]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[15]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Liang Wang,et al.  Beyond Joints: Learning Representations From Primitive Geometries for Skeleton-Based Action Recognition and Detection , 2018, IEEE Transactions on Image Processing.

[17]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Zhi Liu,et al.  3D-based Deep Convolutional Neural Network for action recognition with depth sequences , 2016, Image Vis. Comput..

[19]  Yong Pei,et al.  Robust Multi-Modal Cues for Dyadic Human Interaction Recognition , 2017, MUSA2@MM.

[20]  Yun Fu,et al.  Modeling Supporting Regions for Close Human Interaction Recognition , 2014, ECCV Workshops.

[21]  Nasser Kehtarnavaz,et al.  Real-time human action recognition based on depth motion maps , 2016, Journal of Real-Time Image Processing.

[22]  Yun Fu,et al.  Discriminative Relational Representation Learning for RGB-D Action Recognition , 2016, IEEE Transactions on Image Processing.

[23]  Tae-Kyun Kim,et al.  Learning and Refining of Privileged Information-Based RNNs for Action Recognition from Depth Sequences , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[25]  Rama Chellappa,et al.  Human Action Recognition by Representing 3D Skeletons as Points in a Lie Group , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Jing Zhang,et al.  RGB-D-based action recognition datasets: A survey , 2016, Pattern Recognit..

[27]  Jing Zhang,et al.  Action Recognition From Depth Maps Using Deep Convolutional Neural Networks , 2016, IEEE Transactions on Human-Machine Systems.

[28]  Gang Wang,et al.  Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks , 2017, IEEE Transactions on Image Processing.

[29]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[30]  Zhengming Ding,et al.  Latent Tensor Transfer Learning for RGB-D Action Recognition , 2014, ACM Multimedia.

[31]  Jinwen Ma,et al.  Human Action Recognition Based on DMMs, HOGs and Contourlet Transform , 2015, 2015 IEEE International Conference on Multimedia Big Data.

[32]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[33]  Guodong Guo,et al.  Fusing Spatiotemporal Features and Joints for 3D Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[34]  Chao Li,et al.  Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation , 2018, IJCAI.

[35]  Jake K. Aggarwal,et al.  Multi-Type Activity Recognition from a Robot's Viewpoint , 2017, IJCAI.

[36]  Li Fei-Fei,et al.  Unsupervised Learning of Long-Term Motion Dynamics for Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yun Fu,et al.  Close Human Interaction Recognition Using Patch-Aware Models , 2016, IEEE Transactions on Image Processing.

[38]  Dapeng Tao,et al.  Skeleton embedded motion body partition for human action recognition using depth sequences , 2018, Signal Process..

[39]  Mohammed Bennamoun,et al.  Learning Action Recognition Model from Depth and Skeleton Videos , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[40]  Heng Tao Shen,et al.  Recognition and Detection of Two-Person Interactive Actions Using Automatically Selected Skeleton Features , 2018, IEEE Transactions on Human-Machine Systems.

[41]  Sanghoon Lee,et al.  Ensemble Deep Learning for Skeleton-Based Action Recognition Using Temporal Sliding LSTM Networks , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[42]  Hong Cheng,et al.  Interactive body part contrast mining for human interaction recognition , 2014, 2014 IEEE International Conference on Multimedia and Expo Workshops (ICMEW).

[43]  Tian-Tsong Ng,et al.  Multimodal Multipart Learning for Action Recognition in Depth Videos , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Yaser Mowafi,et al.  Anatomical-plane-based representation for human-human interactions analysis , 2015, Pattern Recognit..

[45]  Pichao Wang,et al.  Skeleton Optical Spectra-Based Action Recognition Using Convolutional Neural Networks , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[46]  Xiaoming Liu,et al.  On Geometric Features for Skeleton-Based Action Recognition Using Multilayer LSTM Networks , 2017, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[47]  Hazem Wannous,et al.  Grassmannian Representation of Motion Depth for 3D Human Gesture and Action Recognition , 2014, 2014 22nd International Conference on Pattern Recognition.

[48]  Pichao Wang,et al.  Joint Distance Maps Based Action Recognition With Convolutional Neural Networks , 2017, IEEE Signal Processing Letters.

[49]  Juan Carlos Niebles,et al.  Sparse composition of body poses and atomic actions for human activity recognition in RGB-D videos , 2017, Image Vis. Comput..

[50]  Yong Du,et al.  Representation Learning of Temporal Dynamics for Skeleton-Based Action Recognition , 2016, IEEE Transactions on Image Processing.

[51]  Remco C. Veltkamp,et al.  Dyadic Interaction Detection from Pose and Flow , 2014, HBU.

[52]  Hong Liu,et al.  Robust 3D Action Recognition Through Sampling Local Appearances and Global Distributions , 2018, IEEE Transactions on Multimedia.

[53]  Pavlo Molchanov,et al.  Online Detection and Classification of Dynamic Hand Gestures with Recurrent 3D Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Mohammed Bennamoun,et al.  A New Representation of Skeleton Sequences for 3D Action Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Alexandros André Chaaraoui,et al.  Fusion of Skeletal and Silhouette-Based Features for Human Action Recognition with RGB-D Devices , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[56]  Hong Cheng,et al.  Learning contrastive feature distribution model for interaction recognition , 2015, J. Vis. Commun. Image Represent..

[57]  ZissermanAndrew,et al.  The Pascal Visual Object Classes Challenge , 2015 .

[58]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Juan Song,et al.  An Online Continuous Human Action Recognition Algorithm Based on the Kinect Sensor , 2016, Sensors.

[60]  Alberto Del Bimbo,et al.  Motion segment decomposition of RGB-D sequences for human behavior understanding , 2017, Pattern Recognit..

[61]  Marco La Cascia,et al.  3D skeleton-based human action classification: A survey , 2016, Pattern Recognit..

[62]  Robert Bergevin,et al.  Semantic human activity recognition: A literature review , 2015, Pattern Recognit..

[63]  Wenjun Zeng,et al.  Online Human Action Detection using Joint Classification-Regression Recurrent Neural Networks , 2016, ECCV.

[64]  Lynne E. Parker,et al.  CoDe4D: Color-Depth Local Spatio-Temporal Features for Human Activity Recognition From RGB-D Videos , 2016, IEEE Transactions on Circuits and Systems for Video Technology.

[65]  Honghai Liu,et al.  Human-human interaction recognition based on spatial and motion trend feature , 2017, 2017 IEEE International Conference on Image Processing (ICIP).

[66]  Wei Guo,et al.  Efficient Interaction Recognition through Positive Action Representation , 2013 .

[67]  Hong Wei,et al.  A survey of human motion analysis using depth imagery , 2013, Pattern Recognit. Lett..

[68]  Silvio Savarese,et al.  Watch-n-patch: Unsupervised understanding of actions and relations , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Hong Liu,et al.  Enhanced skeleton visualization for view invariant human action recognition , 2017, Pattern Recognit..

[70]  Tom Ziemke,et al.  Sensing-Enhanced Therapy System for Assessing Children With Autism Spectrum Disorders: A Feasibility Study , 2019, IEEE Sensors Journal.

[71]  Nasser Kehtarnavaz,et al.  Multi-Temporal Depth Motion Maps-Based Local Binary Patterns for 3-D Human Action Recognition , 2017, IEEE Access.

[72]  Xiaodong Yang,et al.  Super Normal Vector for Activity Recognition Using Depth Sequences , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[73]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[74]  Alan L. Yuille,et al.  Adaptive occlusion state estimation for human pose tracking under self-occlusions , 2013, Pattern Recognit..

[75]  Dimitris Samaras,et al.  Two-person interaction detection using body-pose features and multiple instance learning , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[76]  Stephen J. Maybank,et al.  Activity recognition using a supervised non-parametric hierarchical HMM , 2016, Neurocomputing.

[77]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[78]  Yi Lin,et al.  Skeleton based action recognition using translation-scale invariant image mapping and multi-scale deep CNN , 2017, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW).

[79]  Guijin Wang,et al.  A novel hierarchical framework for human action recognition , 2016, Pattern Recognit..

[80]  Luc Van Gool,et al.  Deep Learning on Lie Groups for Skeleton-Based Action Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[81]  Honghai Liu,et al.  A structured multi-feature representation for recognizing human action and interaction , 2018, Neurocomputing.

[82]  Yu Kong,et al.  Learning hierarchical 3D kernel descriptors for RGB-D action recognition , 2016, Comput. Vis. Image Underst..

[83]  Sergio Escalera,et al.  ChaLearn looking at people: A review of events and resources , 2017, 2017 International Joint Conference on Neural Networks (IJCNN).

[84]  Bin Sheng,et al.  Deep Convolutional Neural Networks for Human Action Recognition Using Depth Maps and Postures , 2019, IEEE Transactions on Systems, Man, and Cybernetics: Systems.

[85]  Daijin Kim,et al.  Robust human activity recognition from depth video using spatiotemporal multi-fused features , 2017, Pattern Recognit..

[86]  Junsong Yuan,et al.  Recognizing Human Actions as the Evolution of Pose Estimation Maps , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[87]  Hema Swetha Koppula,et al.  Anticipating Human Activities Using Object Affordances for Reactive Robotic Response , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[88]  Gérard G. Medioni,et al.  Structured Time Series Analysis for Human Action Segmentation and Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[89]  J.K. Aggarwal,et al.  Human activity analysis , 2011, ACM Comput. Surv..

[90]  Wenbing Zhao,et al.  A Survey of Applications and Human Motion Recognition with Microsoft Kinect , 2015, Int. J. Pattern Recognit. Artif. Intell..

[91]  Hanqing Lu,et al.  Skeleton-Based Action Recognition With Gated Convolutional Neural Networks , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[92]  Anoop Cherian,et al.  Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons , 2016, ECCV.

[93]  Alberto Del Bimbo,et al.  Submitted to Ieee Transactions on Cybernetics 1 3d Human Action Recognition by Shape Analysis of Motion Trajectories on Riemannian Manifold , 2022 .

[94]  Martin Masek,et al.  Joint movement similarities for robust 3D action recognition using skeletal data , 2015, J. Vis. Commun. Image Represent..

[95]  Juan José Pantrigo,et al.  Convolutional Neural Networks and Long Short-Term Memory for skeleton-based human activity and hand gesture recognition , 2018, Pattern Recognit..

[96]  Pichao Wang,et al.  Scene Flow to Action Map: A New Representation for RGB-D Based Action Recognition with Convolutional Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[97]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[98]  Hongsong Wang,et al.  Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[99]  Shih-Fu Chang,et al.  Action Temporal Localization in Untrimmed Videos via Multi-stage CNNs , 2016, ArXiv.

[100]  Mohan M. Trivedi,et al.  Joint Angles Similarities and HOG2 for Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[101]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[102]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[103]  Guo-Jun Qi,et al.  Differential Recurrent Neural Networks for Action Recognition , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[104]  Qing Zhang,et al.  A Survey on Human Motion Analysis from Depth Data , 2013, Time-of-Flight and Depth Imaging.

[105]  Mohammad H. Mahoor,et al.  Human activity recognition using multi-features and multiple kernel learning , 2014, Pattern Recognit..

[106]  Anton van den Hengel,et al.  Learning discriminative trajectorylet detector sets for accurate skeleton-based action recognition , 2015, Pattern Recognit..

[107]  Yansong Tang,et al.  Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[108]  Juan Song,et al.  Learning Spatiotemporal Features Using 3DCNN and Convolutional LSTM for Gesture Recognition , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[109]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[110]  Yun Fu,et al.  Max-Margin Heterogeneous Information Machine for RGB-D Action Recognition , 2017, International Journal of Computer Vision.

[111]  Ling Guan,et al.  Multimodal Learning for Human Action Recognition Via Bimodal/Multimodal Hybrid Centroid Canonical Correlation Analysis , 2019, IEEE Transactions on Multimedia.

[112]  Dimitris Kastaniotis,et al.  Pose-based human action recognition via sparse representation in dissimilarity space , 2014, J. Vis. Commun. Image Represent..

[113]  Ramón F. Brena,et al.  Multi-view stacking for activity recognition with sound and accelerometer data , 2018, Inf. Fusion.

[114]  Yun Fu,et al.  Low-Rank Tensor Subspace Learning for RGB-D Action Recognition , 2016, IEEE Transactions on Image Processing.

[115]  Honghai Liu,et al.  Combining 3D joints Moving Trend and Geometry property for human action recognition , 2016, 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC).

[116]  Jun-Wei Hsieh,et al.  Occluded human body segmentation and its application to behavior analysis , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[117]  Ling Shao,et al.  Action Recognition Using 3D Histograms of Texture and A Multi-Class Boosting Classifier , 2017, IEEE Transactions on Image Processing.

[118]  Gang Wang,et al.  Spatio-Temporal LSTM with Trust Gates for 3D Human Action Recognition , 2016, ECCV.

[119]  Ramakant Nevatia,et al.  Pose Filter Based Hidden-CRF Models for Activity Detection , 2014, ECCV.

[120]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[121]  Xiaodong Yang,et al.  Recognizing actions using depth motion maps-based histograms of oriented gradients , 2012, ACM Multimedia.

[122]  Yi Wang,et al.  Sequential Max-Margin Event Detectors , 2014, ECCV.

[123]  Nicola Bellotto,et al.  Social activity recognition based on probabilistic merging of skeleton features with proximity priors from RGB-D data , 2016, 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[124]  Lei Wu,et al.  Effective Active Skeleton Representation for Low Latency Human Action Recognition , 2016, IEEE Transactions on Multimedia.

[125]  Gang Wang,et al.  NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[126]  Yun Fu,et al.  Bilinear heterogeneous information machine for RGB-D action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[127]  Christian Bauckhage,et al.  Efficient Pose-Based Action Recognition , 2014, ACCV.

[128]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[129]  Dimitrios Makris,et al.  Hierarchical transfer learning for online recognition of compound actions , 2016, Comput. Vis. Image Underst..

[130]  Dahua Lin,et al.  Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition , 2018, AAAI.

[131]  Alexandros André Chaaraoui,et al.  Evolutionary joint selection to improve human action recognition with RGB-D devices , 2014, Expert Syst. Appl..

[132]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[133]  Xin Xu,et al.  Multimodal Gesture Recognition Based on the ResC3D Network , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[134]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[135]  Ling Shao,et al.  From handcrafted to learned representations for human action recognition: A survey , 2016, Image Vis. Comput..

[136]  Alan L. Yuille,et al.  Mining 3D Key-Pose-Motifs for Action Recognition , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[137]  Jake K. Aggarwal,et al.  Robot-centric Activity Recognition from First-Person RGB-D Videos , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[138]  Chalavadi Krishna Mohan,et al.  Human action recognition in RGB-D videos using motion sequence information and deep learning , 2017, Pattern Recognit..

[139]  Hong Liu,et al.  Depth Context: a new descriptor for human activity recognition by using sole depth sequences , 2016, Neurocomputing.

[140]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..