Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis

Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/∼wzou/

[1]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[2]  J. V. van Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[3]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[4]  D. Ruderman,et al.  Independent component analysis of natural image sequences yields spatio-temporal filters similar to simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[5]  David G. Lowe,et al.  Object recognition from local scale-invariant features , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[6]  Bruno A. Olshausen,et al.  Sparse Coding Of Time-Varying Natural Images , 2010 .

[7]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[8]  Aapo Hyvärinen,et al.  Topographic independent component analysis as a model of V1 organization and receptive fields , 2001, Neurocomputing.

[9]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[10]  G LoweDavid,et al.  Distinctive Image Features from Scale-Invariant Keypoints , 2004 .

[11]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[12]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[13]  Laurenz Wiskott,et al.  Slow feature analysis yields a rich repertoire of complex cell properties. , 2005, Journal of vision.

[14]  I. Patras,et al.  Spatiotemporal salient points for visual recognition of human actions , 2005, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[15]  Cordelia Schmid,et al.  A Comparison of Affine Region Detectors , 2005, International Journal of Computer Vision.

[16]  Cordelia Schmid,et al.  A performance evaluation of local descriptors , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[18]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Mubarak Shah,et al.  Recognizing human actions , 2005, VSSN@MM.

[20]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[21]  Juan Carlos Niebles,et al.  Unsupervised Learning of Human Action Categories Using Spatial-Temporal Words , 2006, BMVC.

[22]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[23]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[24]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[25]  Rajat Raina,et al.  Self-taught learning: transfer learning from unlabeled data , 2007, ICML '07.

[26]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[27]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[28]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Cordelia Schmid,et al.  A Spatio-Temporal Descriptor Based on 3D-Gradients , 2008, BMVC.

[30]  Luc Van Gool,et al.  Speeded-Up Robust Features (SURF) , 2008, Comput. Vis. Image Underst..

[31]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[32]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[33]  Quoc V. Le,et al.  Measuring Invariances in Deep Networks , 2009, NIPS.

[34]  Peyman Milanfar,et al.  Detection of human actions from a single example , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[36]  Christopher Hunt,et al.  Notes on the OpenSURF Library , 2009 .

[37]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[38]  Jiebo Luo,et al.  Recognizing realistic actions from videos “in the wild” , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[39]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[40]  Le Li,et al.  SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding: SENSC: a Stable and Efficient Algorithm for Nonnegative Sparse Coding , 2009 .

[41]  Rajat Raina,et al.  Large-scale deep unsupervised learning using graphics processors , 2009, ICML '09.

[42]  Quoc V. Le,et al.  Tiled convolutional neural networks , 2010, NIPS.

[43]  Garrison W. Cottrell,et al.  Robust classification of objects, faces, and flowers using natural image statistics , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[44]  David Elliott,et al.  In the Wild , 2010 .

[45]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[46]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.