Human Action Recognition Using a Temporal Hierarchy of Covariance Descriptors on 3D Joint Locations

Human action recognition from videos is a challenging machine vision task with multiple important application domains, such as human-robot/machine interaction, interactive entertainment, multimedia information retrieval, and surveillance. In this paper, we present a novel approach to human action recognition from 3D skeleton sequences extracted from depth data. We use the covariance matrix for skeleton joint locations over time as a discriminative descriptor for a sequence. To encode the relationship between joint movement and time, we deploy multiple covariance matrices over sub-sequences in a hierarchical fashion. The descriptor has a fixed length that is independent from the length of the described sequence. Our experiments show that using the covariance descriptor with an off-the-shelf classification algorithm outperforms the state of the art in action recognition on multiple datasets, captured either via a Kinect-type sensor or a sophisticated motion capture system. We also include an evaluation on a novel large dataset using our own annotation.

[1]  Marilyn M. Mantei,et al.  Proceedings of the SIGCHI Conference on Human Factors in Computing Systems , 1986, CHI 1986.

[2]  James W. Davis,et al.  The Recognition of Human Movement Using Temporal Templates , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[3]  James W. Davis Hierarchical motion history images for recognizing human motion , 2001, Proceedings IEEE Workshop on Detection and Recognition of Events in Video.

[4]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[5]  Cordelia Schmid,et al.  Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[6]  Fatih Murat Porikli,et al.  Region Covariance: A Fast Descriptor for Detection and Classification , 2006, ECCV.

[7]  Tido Röder,et al.  Documentation Mocap Database HDM05 , 2007 .

[8]  Fatih Murat Porikli,et al.  Pedestrian Detection via Classification on Riemannian Manifolds , 2008, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Larry S. Davis,et al.  Kernel integral images: A framework for fast non-uniform filtering , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[10]  Wanqing Li,et al.  Action recognition based on a bag of 3D points , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops.

[11]  Wei Liang,et al.  Discriminative human action recognition in the learned hierarchical manifold space , 2010, Image Vis. Comput..

[12]  Ilya Sutskever,et al.  Learning Recurrent Neural Networks with Hessian-Free Optimization , 2011, ICML.

[13]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[14]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[15]  Geoffrey E. Hinton,et al.  Conditional Restricted Boltzmann Machines for Structured Output Prediction , 2011, UAI.

[16]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[17]  Luc Van Gool,et al.  Does Human Action Recognition Benefit from Pose Estimation? , 2011, BMVC.

[18]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, CVPR Workshops.

[19]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Sebastian Nowozin,et al.  Action Points: A Representation for Low-latency Online Human Action Recognition , 2012 .

[21]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[22]  Joseph J. LaViola,et al.  Exploring the Trade-off Between Accuracy and Observational Latency in Action Recognition , 2013, International Journal of Computer Vision.

[23]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[24]  Helena M. Mentis,et al.  Instructing people for training gestural interactive systems , 2012, CHI.

[25]  Ruzena Bajcsy,et al.  Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[26]  Sriram Subramanian,et al.  Talking about tactile experiences , 2013, CHI.

[27]  Marwan Torki,et al.  Histogram of Oriented Displacements (HOD): Describing Trajectories of Human Joints for Action Recognition , 2013, IJCAI.

[28]  Brian C. Lovell,et al.  Spatio-temporal covariance descriptors for action and gesture recognition , 2013, 2013 IEEE Workshop on Applications of Computer Vision (WACV).

[29]  Hai Yang,et al.  ACM Transactions on Intelligent Systems and Technology - Special Section on Urban Computing , 2014 .