On spatio-temporal sparse coding : Analysis and an algorithm

Sparse coding is a common approach to learning local features for object recognition. Recently, there has been an increasing interest in learning features from spatio-temporal, binocular, or other multi-observation data, where the goal is to encode the relationship between images rather than the content of a single image. We discuss the role of multiplicative interactions and of squaring non-linearities in learning such relations. In particular, we show that training a sparse coding model whose filter responses are squared amounts to jointly diagonalizing a set of image transformations. Inference amounts to detecting rotations in the shared eigenspaces. Our analysis helps explain recent experimental results showing that Fourier features and circular Fourier features emerge when training complex cell models on translating or rotating images. And it suggests that it will be crucial to include either squaring or cross-products into deep learning architectures if we want to extend their applicability beyond simple tasks like recognizing objects in a single image.

[1]  Ning Qian,et al.  Computing Stereo Disparity and Motion with Known Binocular Cell Properties , 1994, Neural Computation.

[2]  Bruno A. Olshausen,et al.  Bilinear models of natural images , 2007, Electronic Imaging.

[3]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[4]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[5]  Robert M. Gray,et al.  Toeplitz and Circulant Matrices: A Review , 2005, Found. Trends Commun. Inf. Theory.

[6]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Geoffrey E. Hinton,et al.  Learning Sparse Topographic Representations with Products of Student-t Distributions , 2002, NIPS.

[8]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[9]  T. Sanger,et al.  Stereo disparity computation using Gabor filters , 1988, Biological Cybernetics.

[10]  Quoc V. Le,et al.  Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis , 2011, CVPR 2011.

[11]  Roland Memisevic Learning to relate images: Mapping units, complex cells and simultaneous eigenspaces , 2011, ArXiv.

[12]  Yann LeCun,et al.  Convolutional Learning of Spatio-temporal Features , 2010, ECCV.

[13]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[14]  I. Ohzawa,et al.  Stereoscopic depth discrimination in the visual cortex: neurons ideally suited as disparity detectors. , 1990, Science.

[15]  Roland Memisevic,et al.  Gradient-based learning of higher-order image features , 2011, 2011 International Conference on Computer Vision.

[16]  Matthias Bethge,et al.  Unsupervised learning of a steerable basis for invariant image representations , 2007, Electronic Imaging.

[17]  Yoshua Bengio,et al.  Suitability of V1 Energy Models for Object Classification , 2011, Neural Computation.

[18]  David J. Fleet,et al.  Neural encoding of binocular disparity: Energy models, position shifts and phase shifts , 1996, Vision Research.

[19]  Rajesh P. N. Rao,et al.  Bilinear Sparse Coding for Invariant Vision , 2005, Neural Computation.

[20]  Heinrich H. Bülthoff,et al.  Human stereovision without localized image features , 1995, Biological Cybernetics.

[21]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.