Volumetric spatial feature representation for view-invariant human action recognition using a depth camera

Abstract. The problem of viewpoint variations is a challenging issue in vision-based human action recognition. With the richer information provided by three-dimensional (3-D) point clouds thanks to the advent of 3-D depth cameras, we can effectively analyze spatial variations in human actions. In this paper, we propose a volumetric spatial feature representation (VSFR) that measures the density of 3-D point clouds for view-invariant human action recognition from depth sequence images. Using VSFR, we construct a self-similarity matrix (SSM) that can graphically represent temporal variations in the depth sequence. To obtain an SSM, we compute the squared Euclidean distance of VSFRs between a pair of frames in a video sequence. In this manner, an SSM represents the dissimilarity between a pair of frames in terms of spatial information in a video sequence captured at an arbitrary viewpoint. Furthermore, due to the use of a bag-of-features method for feature representations, the proposed method efficiently handles the variations of action speed or length. Hence, our method is robust to both variations in viewpoints and lengths of action sequences. We evaluated the proposed method by comparing with state-of-the-art methods in the literature on three public datasets of ACT42, MSRAction3D, and MSRDailyActivity3D, validating the superiority of our method by achieving the highest accuracies.

[1]  Mario Fernando Montenegro Campos,et al.  STOP: Space-Time Occupancy Patterns for 3D Action Recognition from Depth Map Sequences , 2012, CIARP.

[2]  Eli Shechtman,et al.  Matching Local Self-Similarities across Images and Videos , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Barbara Caputo,et al.  Local velocity-adapted motion events for spatio-temporal recognition , 2007, Comput. Vis. Image Underst..

[4]  Larry S. Davis,et al.  Gait Recognition Using Image Self-Similarity , 2004, EURASIP J. Adv. Signal Process..

[5]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Ying Wu,et al.  Mining actionlet ensemble for action recognition with depth cameras , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Patrick Pérez,et al.  View-Independent Action Recognition from Temporal Self-Similarities , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Thomas B. Moeslund,et al.  A Local 3-D Motion Descriptor for Multi-View Human Action Recognition from 4-D Spatio-Temporal Interest Points , 2012, IEEE Journal of Selected Topics in Signal Processing.

[10]  Andrew W. Fitzgibbon,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR 2011.

[11]  Robert Sabourin,et al.  “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? , 2006 .

[12]  Toby Sharp,et al.  Real-time human pose recognition in parts from single depth images , 2011, CVPR.

[13]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[14]  Rémi Ronfard,et al.  A survey of vision-based methods for action representation, segmentation and recognition , 2011, Comput. Vis. Image Underst..

[15]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories , 2006 .

[16]  D.M. Mount,et al.  An Efficient k-Means Clustering Algorithm: Analysis and Implementation , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[17]  Heung-Il Suk,et al.  A Novel Bayesian Framework for Discriminative Feature Extraction in Brain-Computer Interfaces , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Adrian Hilton,et al.  Shape Similarity for 3D Video Sequences of People , 2010, International Journal of Computer Vision.

[19]  Jake K. Aggarwal,et al.  Spatio-temporal Depth Cuboid Similarity Feature for Activity Recognition Using Depth Camera , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[20]  Ying Wu,et al.  Robust 3D Action Recognition with Random Occupancy Patterns , 2012, ECCV.

[21]  Qi Tian,et al.  Human Daily Action Analysis with Multi-view and Color-Depth Data , 2012, ECCV Workshops.

[22]  Jake K. Aggarwal,et al.  View invariant human action recognition using histograms of 3D joints , 2012, 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops.

[23]  Seong-Whan Lee,et al.  View-independent human action recognition with Volume Motion Template on single stereo camera , 2010, Pattern Recognit. Lett..

[24]  Zicheng Liu,et al.  HON4D: Histogram of Oriented 4D Normals for Activity Recognition from Depth Sequences , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Ronald Poppe,et al.  A survey on vision-based human action recognition , 2010, Image Vis. Comput..

[27]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.