Skeleton-based bio-inspired human activity prediction for real-time human–robot interaction

Activity prediction is an essential task in practical human-centered robotics applications, such as security, assisted living, etc., which is targeted at inferring ongoing human activities based on incomplete observations. To address this challenging problem, we introduce a novel bio-inspired predictive orientation decomposition (BIPOD) approach to construct representations of people from 3D skeleton trajectories. BIPOD is invariant to scales and viewpoints, runs in real-time on basic computer systems, and is able to recognize and predict activities in an online fashion. Our approach is inspired by biological research in human anatomy. To capture spatio-temporal information of human motions, we spatially decompose 3D human skeleton trajectories and project them onto three anatomical planes (i.e., coronal, transverse and sagittal planes); then, we describe short-term time information of joint motions and encode high-order temporal dependencies. By using Extended Kalman Filters to estimate future skeleton trajectories, we endow our BIPOD representation with the critical capabilities to reduce noisy skeleton observation data and predict the ongoing activities. Experiments on benchmark datasets have shown that our BIPOD representation significantly outperforms previous methods for real-time human activity classification and prediction from 3D skeleton trajectories. Empirical studies using TurtleBot2 and Baxter humanoid robots have also validated that our BIPOD method obtains promising performance, in terms of both accuracy and efficiency, making BIPOD a fast, simple, yet powerful representation for low-latency online activity prediction in human–robot interaction applications.

[1]  Xiaodong Yang,et al.  EigenJoints-based action recognition using Naïve-Bayes-Nearest-Neighbor , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[2]  Gang Yu,et al.  Predicting human activities using spatio-temporal structure of interest points , 2012, ACM Multimedia.

[3]  Andrew Zisserman,et al.  Efficient additive kernels via explicit feature maps , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[4]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Michael S. Ryoo,et al.  Human activity prediction: Early recognition of ongoing activities from streaming videos , 2011, 2011 International Conference on Computer Vision.

[6]  Xiaodong Yang,et al.  Effective 3D action recognition using EigenJoints , 2014, J. Vis. Commun. Image Represent..

[7]  Ling Shao,et al.  Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Lynne E. Parker,et al.  Real-Time Multiple Human Perception With Color-Depth Cameras on a Mobile Robot , 2013, IEEE Transactions on Cybernetics.

[9]  Johannes W. Rohen,et al.  Color atlas of anatomy : a photographic study of the human body / Johannes W. Rohen, Chihiro Yokochi ; with the collaboration of Lynn J. Romrell , 1983 .

[10]  Jianwu Zhang,et al.  The study of driver's starting intentions , 2011, 2011 Second International Conference on Mechanic Automation and Control Engineering.

[11]  Nassir Navab,et al.  Human skeleton tracking from depth data using geodesic distances and optical flow , 2012, Image Vis. Comput..

[12]  Arif Mahmood,et al.  Real time action recognition using histograms of depth gradients and random decision forests , 2014, IEEE Winter Conference on Applications of Computer Vision.

[13]  Yun Fu,et al.  Modeling Complex Temporal Composition of Actionlets for Activity Prediction , 2012, ECCV.

[14]  Bart Selman,et al.  Unstructured human activity detection from RGBD images , 2011, 2012 IEEE International Conference on Robotics and Automation.

[15]  R. Venkatesh Babu,et al.  Real-time human action recognition from motion capture data , 2013, 2013 Fourth National Conference on Computer Vision, Pattern Recognition, Image Processing and Graphics (NCVPRIPG).

[16]  Alex Pentland,et al.  Modeling and Prediction of Human Behavior , 1999, Neural Computation.

[17]  Brian C. Lovell,et al.  Sparse Coding and Dictionary Learning for Symmetric Positive Definite Matrices: A Kernel Approach , 2012, ECCV.

[18]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Alois Knoll,et al.  Action recognition using ensemble weighted multi-instance learning , 2014, 2014 IEEE International Conference on Robotics and Automation (ICRA).

[20]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Chang Wang,et al.  Driving intention recognition and behaviour prediction based on a double-layer hidden Markov model , 2012, Journal of Zhejiang University SCIENCE C.

[22]  Yiannis Demiris,et al.  Predicting car states through learned models of vehicle dynamics and user behaviours , 2015, 2015 IEEE Intelligent Vehicles Symposium (IV).

[23]  Hedvig Kjellström,et al.  Audio-visual classification and detection of human manipulation actions , 2014, 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[24]  Gaetano Bosurgi,et al.  Prediction of Drivers’ Visual Strategy Using an Analytical Model , 2015 .

[25]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[26]  Jake K. Aggarwal,et al.  Robot-Centric Activity Prediction from First-Person Videos: What Will They Do to Me? , 2015, 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[27]  Xin Wang,et al.  Modeling transition patterns between events for temporal human action segmentation and classification , 2015, 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG).

[28]  Alberto Del Bimbo,et al.  Recognizing Actions from Depth Cameras as Weakly Aligned Multi-part Bag-of-Poses , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[29]  Yun Fu,et al.  Prediction of Human Activity by Discovering Temporal Sequence Patterns , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  Mary L. Cummings,et al.  Predictive models of human supervisory control behavioral patterns using hidden semi-Markov models , 2011, Eng. Appl. Artif. Intell..

[31]  Einoshin Suzuki,et al.  Classifying actions based on histogram of oriented velocity vectors , 2014, Journal of Intelligent Information Systems.

[32]  J. Laidlaw,et al.  ANATOMY OF THE HUMAN BODY , 1967, The Ulster Medical Journal.

[33]  Junqiang Xi,et al.  Modeling and Recognizing Driver Behavior Based on Driving Data: A Survey , 2014 .

[34]  J. Mandel Use of the Singular Value Decomposition in Regression Analysis , 1982 .

[35]  Gys Albertus Marthinus Meiring,et al.  A Review of Intelligent Driving Style Analysis Systems and Related Artificial Intelligence Algorithms , 2015, Sensors.

[36]  Mark Everingham,et al.  Learning shape models for monocular human pose estimation from the Microsoft Xbox Kinect , 2011, 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).

[37]  Luc Van Gool,et al.  Metric Learning from Poses for Temporal Clustering of Human Motion , 2012, BMVC.

[38]  Fernando De la Torre,et al.  Max-Margin Early Event Detectors , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Elsevier Sdol,et al.  Journal of Visual Communication and Image Representation , 2009 .

[40]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[41]  Marwan Torki,et al.  Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations , 2013, IJCAI.

[42]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[43]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[44]  Cristian Sminchisescu,et al.  The Moving Pose: An Efficient 3D Kinematics Descriptor for Low-Latency Action Recognition and Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[45]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[46]  Xiaohui Xie,et al.  Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks , 2016, AAAI.

[47]  Ling Shao,et al.  Structure-Preserving Binary Representations for RGB-D Action Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Bernhard Schölkopf,et al.  Anticipatory action selection for human-robot table tennis , 2017, Artif. Intell..

[49]  Jake K. Aggarwal,et al.  Human activity recognition from 3D data: A review , 2014, Pattern Recognit. Lett..

[50]  Junsong Yuan,et al.  Learning Actionlet Ensemble for 3D Human Action Recognition , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Lynne E. Parker,et al.  Using on-line Conditional Random Fields to determine human intent for peer-to-peer human robot teaming , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[52]  Hema Swetha Koppula,et al.  Learning human activities and object affordances from RGB-D videos , 2012, Int. J. Robotics Res..

[53]  Garry A. Einicke,et al.  Robust extended Kalman filtering , 1999, IEEE Trans. Signal Process..

[54]  Andrew W. Fitzgibbon,et al.  Efficient regression of general-activity human poses from depth images , 2011, 2011 International Conference on Computer Vision.

[55]  Ho Yub Jung,et al.  Random tree walk toward instantaneous 3D human pose estimation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Lynne E. Parker,et al.  4-dimensional local spatio-temporal features for human activity recognition , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[57]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[58]  Xiaochun Cao,et al.  Action Recognition Using Subtensor Constraint , 2012, ECCV.

[59]  Fei Han,et al.  Space-Time Representation of People Based on 3D Skeletal Data: A Review , 2016, Comput. Vis. Image Underst..

[60]  Siddhartha S. Srinivasa,et al.  Human-robot mutual adaptation in collaborative tasks: Models and experiments , 2017, Int. J. Robotics Res..

[61]  Juan Carlos Niebles,et al.  A Hierarchical Model of Shape and Appearance for Human Action Classification , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[62]  Hairong Qi,et al.  Group Sparsity and Geometry Constrained Dictionary Learning for Action Recognition from Depth Maps , 2013, 2013 IEEE International Conference on Computer Vision.

[63]  Maya Cakmak,et al.  Keyframe-based Learning from Demonstration , 2012, Int. J. Soc. Robotics.

[64]  Julie A. Shah,et al.  Fast target prediction of human reaching motion for cooperative human-robot manipulation tasks using time series classification , 2015, 2015 IEEE International Conference on Robotics and Automation (ICRA).

[65]  Sebastian Thrun,et al.  Real time motion capture using a single time-of-flight camera , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[66]  Jake K. Aggarwal,et al.  A task-driven intelligent workspace system to provide guidance feedback , 2010, Comput. Vis. Image Underst..

[67]  Yuying Jiang,et al.  Driver intention recognition based on Continuous Hidden Markov Model , 2011, Proceedings 2011 International Conference on Transportation, Mechanical, and Electrical Engineering (TMEE).

[68]  Ruzena Bajcsy,et al.  Bio-inspired Dynamic 3D Discriminative Skeletal Features for Human Action Recognition , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[69]  Sebastian Thrun,et al.  Real-time identification and localization of body parts from depth images , 2010, 2010 IEEE International Conference on Robotics and Automation.

[70]  Ryo Kurazume,et al.  Early Recognition and Prediction of Gestures , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[71]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[72]  Luzheng Bi,et al.  Inferring driver intentions using a driver model based on queuing network , 2013, 2013 IEEE Intelligent Vehicles Symposium (IV).

[73]  Klaus C. J. Dietmayer,et al.  Continuous Driver Intention Recognition with Hidden Markov Models , 2008, 2008 11th International IEEE Conference on Intelligent Transportation Systems.

[74]  Brion Benninger,et al.  Color Atlas of Anatomy: Photographic Study of the Human Body , 2008 .

[75]  Quan Z. Sheng,et al.  Online human gesture recognition from motion data streams , 2013, ACM Multimedia.