Spatio-temporal steerable pyramid for human action recognition

In this paper, we propose a novel holistic representation based on the spatio-temporal steerable pyramid (STSP) for human action recognition. The spatio-temporal Laplacian pyramid provides an effective technique for multi-scale analysis of video sequences. By decomposing spatio-temporal volumes into band-passed sub-volumes, spatio-temporal patterns residing in different scales will be nicely localized. Then three-dimensional separable steerable filters are conducted on each of the sub-volume to capture the spatio-temporal orientation information efficiently. The outputs of the quadrature pair of steerable filters are squared and summed to yield a more robust measure of motion energy. To make the representation invariant to shifting and applicable with coarsely-extracted bounding boxes for the performed actions, max pooling operations are employed between responses of the filtering at adjacent scales, and over spatio-temporal local neighborhoods. Taking advantage of multi-scale and multi-orientation analysis and feature pooling, STSP produces a compact but informative and invariant representation of human actions. We conduct extensive experiments on the KTH, IXMAS and HMDB51 datasets, and the proposed STSP achieves comparable results with the state-of-the-art methods.

[1]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[2]  Mubarak Shah,et al.  A 3-dimensional sift descriptor and its application to action recognition , 2007, ACM Multimedia.

[3]  William T. Freeman,et al.  Presented at: 2nd Annual IEEE International Conference on Image , 1995 .

[4]  Pietro Perona,et al.  Overcomplete steerable pyramid filters and rotation invariance , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[5]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[6]  Luc Van Gool,et al.  A Hough transform-based voting framework for action recognition , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[7]  Luc Van Gool,et al.  Action snippets: How many frames does human action recognition require? , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[8]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[9]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[11]  Jean Ponce,et al.  A Theoretical Analysis of Feature Pooling in Visual Recognition , 2010, ICML.

[12]  Thomas Serre,et al.  HMDB: A large video database for human motion recognition , 2011, 2011 International Conference on Computer Vision.

[13]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[14]  Rémi Ronfard,et al.  Action Recognition from Arbitrary Views using 3D Exemplars , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[15]  Richard P. Wildes,et al.  Efficient action spotting based on a spacetime oriented structure representation , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[16]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[17]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[18]  Konstantinos G. Derpanis,et al.  Three-dimensional nth derivative of Gaussian separable steerable filters , 2005, IEEE International Conference on Image Processing 2005.

[19]  Xuelong Li,et al.  Patch Alignment for Dimensionality Reduction , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Mubarak Shah,et al.  Learning 4D action feature models for arbitrary view action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  R. Wildes,et al.  Early spatiotemporal grouping with a distributed oriented energy representation , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  Lior Wolf,et al.  Local Trinary Patterns for human action recognition , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[23]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[24]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[25]  Richard P. Wildes,et al.  Qualitative Spatiotemporal Analysis Using an Oriented Energy Representation , 2000, ECCV.

[26]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Richard P. Wildes,et al.  Visual Tracking Using a Pixelwise Spatiotemporal Oriented Energy Representation , 2010, ECCV.

[29]  Patrick Pérez,et al.  Cross-View Action Recognition from Temporal Self-similarities , 2008, ECCV.