Convolutional Learning of Spatio-temporal Features

We address the problem of learning good features for understanding video data. We introduce a model that learns latent representations of image sequences from pairs of successive images. The convolutional architecture of our model allows it to scale to realistic image sizes whilst using a compact parametrization. In experiments on the NORB dataset, we show our model extracts latent "flow fields" which correspond to the transformation between the pair of input frames. We also use our model to extract low-level motion features in a multi-stage architecture for action recognition, demonstrating competitive performance on both the KTH and Hollywood2 datasets.

[1]  David Haussler,et al.  Unsupervised learning of distributions on binary vectors using two layer networks , 1991, NIPS 1991.

[2]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[3]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[4]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[5]  Ivan Laptev,et al.  On Space-Time Interest Points , 2003, Proceedings Ninth IEEE International Conference on Computer Vision.

[6]  Miguel Á. Carreira-Perpiñán,et al.  Multiscale conditional random fields for image labeling , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[7]  Barbara Caputo,et al.  Recognizing human actions: a local SVM approach , 2004, Proceedings of the 17th International Conference on Pattern Recognition, 2004. ICPR 2004..

[8]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[9]  Serge J. Belongie,et al.  Behavior recognition via sparse spatio-temporal features , 2005, 2005 IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Marc'Aurelio Ranzato,et al.  Efficient Learning of Sparse Representations with an Energy-Based Model , 2006, NIPS.

[12]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[13]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[14]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[15]  Sang Uk Lee,et al.  Three-dimensional oil painting reconstruction with stroke based rendering , 2007, The Visual Computer.

[16]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[17]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[18]  Geoffrey E. Hinton,et al.  Learning Multilevel Distributed Representations for High-Dimensional Sequences , 2007, AISTATS.

[19]  Andrew J. Davison,et al.  Active Matching , 2008, ECCV.

[20]  Cordelia Schmid,et al.  Learning realistic human actions from movies , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[21]  Luc Van Gool,et al.  An Efficient Dense and Scale-Invariant Spatio-Temporal Interest Point Detector , 2008, ECCV.

[22]  Bruno A. Olshausen,et al.  Learning Transformational Invariants from Natural Movies , 2008, NIPS.

[23]  Nicolas Pinto,et al.  Why is Real-World Visual Object Recognition Hard? , 2008, PLoS Comput. Biol..

[24]  Mohammad Norouzi,et al.  Stacks of convolutional Restricted Boltzmann Machines for shift-invariant feature learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[25]  Honglak Lee,et al.  Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations , 2009, ICML '09.

[26]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[27]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[29]  Thomas L. Dean,et al.  Recursive Sparse, Spatiotemporal Coding , 2009, 2009 11th IEEE International Symposium on Multimedia.

[30]  Cordelia Schmid,et al.  Evaluation of Local Spatio-temporal Features for Action Recognition , 2009, BMVC.

[31]  Guillermo Sapiro,et al.  Online dictionary learning for sparse coding , 2009, ICML '09.

[32]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.