Learning Intermediate-Level Representations of Form and Motion from Natural Movies

Abstract We present a model of intermediate-level visual representation that is based on learning invariances from movies of the natural environment. The model is composed of two stages of processing: an early feature representation layer and a second layer in which invariances are explicitly represented. Invariances are learned as the result of factoring apart the temporally stable and dynamic components embedded in the early feature representation. The structure contained in these components is made explicit in the activities of second-layer units that capture invariances in both form and motion. When trained on natural movies, the first layer produces a factorization, or separation, of image content into a temporally persistent part representing local edge structure and a dynamic part representing local motion structure, consistent with known response properties in early visual cortex (area V1). This factorization linearizes statistical dependencies among the first-layer units, making them learnable by the second layer. The second-layer units are split into two populations according to the factorization in the first layer. The form-selective units receive their input from the temporally persistent part (local edge structure) and after training result in a diverse set of higher-order shape features consisting of extended contours, multiscale edges, textures, and texture boundaries. The motion-selective units receive their input from the dynamic part (local motion structure) and after training result in a representation of image translation over different spatial scales and directions, in addition to more complex deformations. These representations provide a rich description of dynamic natural images and testable hypotheses regarding intermediate-level representation in visual cortex.

[1]  D. Hubel,et al.  Receptive fields and functional architecture of monkey striate cortex , 1968, The Journal of physiology.

[2]  J. Gibson,et al.  The Senses Considered As Perceptual Systems , 1967 .

[3]  H. Barrow,et al.  RECOVERING INTRINSIC SCENE CHARACTERISTICS FROM IMAGES , 1978 .

[4]  D. Pollen,et al.  Phase relationships between adjacent simple cells in the visual cortex. , 1981, Science.

[5]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[6]  E H Adelson,et al.  Spatiotemporal energy models for the perception of motion. , 1985, Journal of the Optical Society of America. A, Optics and image science.

[7]  Bernhard Wegmann,et al.  Statistical dependence between orientation filter outputs used in a human-vision-based image code , 1990, Other Conferences.

[8]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[9]  Michael S. Landy,et al.  Computational models of visual processing , 1991 .

[10]  Michael S. Landy,et al.  Nonlinear Model of Neural Responses in Cat Visual Cortex , 1991 .

[11]  D. G. Albrecht,et al.  Motion selectivity and the contrast-response function of simple cells in the visual cortex , 1991, Visual Neuroscience.

[12]  Kechen Zhang,et al.  Emergence of Position-Independent Detectors of Sense of Rotation and Dilation with Hebbian Learning: An Analysis , 1999, Neural Computation.

[13]  D. V. van Essen,et al.  A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information , 1993, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[14]  Keiji Tanaka,et al.  Neuronal selectivities to complex object features in the ventral visual pathway of the macaque cerebral cortex. , 1994, Journal of neurophysiology.

[15]  David J. Field,et al.  What Is the Goal of Sensory Coding? , 1994, Neural Computation.

[16]  T. Sejnowski,et al.  A selection model for motion processing in area MT of primates , 1995, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[17]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[18]  S. Ullman High-Level Vision: Object Recognition and Visual Cognition , 1996 .

[19]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[20]  Eero P. Simoncelli Statistical models for images: compression, restoration and synthesis , 1997, Conference Record of the Thirty-First Asilomar Conference on Signals, Systems and Computers (Cat. No.97CB36136).

[21]  E. Rolls,et al.  INVARIANT FACE AND OBJECT RECOGNITION IN THE VISUAL SYSTEM , 1997, Progress in Neurobiology.

[22]  Terrence J. Sejnowski,et al.  The “independent components” of natural scenes are edge filters , 1997, Vision Research.

[23]  J. V. van Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[24]  J. H. Hateren,et al.  Independent component filters of natural images compared with simple cells in primary visual cortex , 1998 .

[25]  Eero P. Simoncelli,et al.  A model of neuronal responses in visual area MT , 1998, Vision Research.

[26]  Julian Magarey,et al.  Motion estimation using a complex-valued wavelet transform , 1998, IEEE Trans. Signal Process..

[27]  Gerhard Krieger,et al.  The atoms of vision: Cartesian or polar? , 1999 .

[28]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[29]  I. Ohzawa,et al.  Functional Micro-Organization of Primary Visual Cortex: Receptive Field Analysis of Nearby Neurons , 1999, The Journal of Neuroscience.

[30]  Ramesh A. Gopinath,et al.  Gaussianization , 2000, NIPS.

[31]  Joshua B. Tenenbaum,et al.  Separating Style and Content with Bilinear Models , 2000, Neural Computation.

[32]  Aapo Hyvärinen,et al.  Emergence of Phase- and Shift-Invariant Features by Decomposition of Natural Images into Independent Feature Subspaces , 2000, Neural Computation.

[33]  Eero P. Simoncelli,et al.  Natural signal statistics and sensory gain control , 2001, Nature Neuroscience.

[34]  Aapo Hyvärinen,et al.  Topographic Independent Component Analysis , 2001, Neural Computation.

[35]  Y. Yamane,et al.  Complex objects are represented in macaque inferotemporal cortex by the combination of feature columns , 2001, Nature Neuroscience.

[36]  Michael S. Lewicki,et al.  A Model for Learning Variance Components of Natural Images , 2002, NIPS.

[37]  Christoph Kayser,et al.  Learning the invariance properties of complex cells from their responses to natural stimuli , 2002, The European journal of neuroscience.

[38]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[39]  A. Hyvärinen,et al.  A multi-layer sparse coding network learns contour coding from natural images , 2002, Vision Research.

[40]  Aapo Hyvärinen,et al.  Bubbles: a unifying framework for low-level statistical properties of natural image sequences. , 2003, Journal of the Optical Society of America. A, Optics, image science, and vision.

[41]  M. Lewicki,et al.  Learning higher-order structures in natural images , 2003, Network.

[42]  David J. Fleet,et al.  Computation of component image velocity from local phase information , 1990, International Journal of Computer Vision.

[43]  Andrea J. van Doorn,et al.  The Generic Bilinear Calibration-Estimation Problem , 2004, International Journal of Computer Vision.

[44]  Takeo Kanade,et al.  Shape and motion from image streams under orthography: a factorization method , 1992, International Journal of Computer Vision.

[45]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[46]  Aapo Hyvärinen,et al.  Statistical model of natural stimuli predicts edge-like pooling of spatial frequency channels in V2 , 2004, BMC Neuroscience.

[47]  Rajesh P. N. Rao,et al.  Bilinear Sparse Coding for Invariant Vision , 2005, Neural Computation.

[48]  Laurenz Wiskott,et al.  Slow feature analysis yields a rich repertoire of complex cell properties. , 2005, Journal of vision.

[49]  Michael S. Lewicki,et al.  A Hierarchical Bayesian Model for Learning Nonlinear Statistical Regularities in Nonstationary Natural Signals , 2005, Neural Computation.

[50]  Michael S. Lewicki,et al.  Is Early Vision Optimized for Extracting Higher-order Dependencies? , 2005, NIPS.

[51]  D. Bradley,et al.  Structure and function of visual area MT. , 2005, Annual review of neuroscience.

[52]  Eero P. Simoncelli,et al.  How MT cells analyze the motion of visual patterns , 2006, Nature Neuroscience.

[53]  Garrison W. Cottrell,et al.  Recursive ICA , 2006, NIPS.

[54]  Terrence J. Sejnowski,et al.  Soft Mixer Assignment in a Hierarchical Generative Model of Natural Scene Statistics , 2006, Neural Computation.

[55]  Thomas Serre,et al.  A Biologically Inspired System for Action Recognition , 2007, 2007 IEEE 11th International Conference on Computer Vision.

[56]  Honglak Lee,et al.  Sparse deep belief net model for visual area V2 , 2007, NIPS.

[57]  Geoffrey E. Hinton,et al.  Unsupervised Learning of Image Transformations , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[59]  David D. Cox,et al.  Opinion TRENDS in Cognitive Sciences Vol.11 No.8 Untangling invariant object recognition , 2022 .

[60]  Edmund T. Rolls,et al.  Invariant Global Motion Recognition in the Dorsal Visual System: A Unifying Theory , 2007, Neural Computation.

[61]  Aapo Hyvärinen,et al.  A Two-Layer ICA-Like Model Estimated by Score Matching , 2007, ICANN.

[62]  Eric T. Carlson,et al.  A neural code for three-dimensional object shape in macaque inferotemporal cortex , 2008, Nature Neuroscience.

[63]  Bruno A. Olshausen,et al.  Learning Transformational Invariants from Natural Movies , 2008, NIPS.

[64]  Guy A Orban,et al.  Higher order visual processing in macaque extrastriate cortex. , 2008, Physiological reviews.

[65]  Matthias Bethge,et al.  The Conjoint Effect of Divisive Normalization and Orientation Selectivity on Redundancy Reduction , 2008, NIPS.

[66]  Bruno A. Olshausen,et al.  Learning real and complex overcomplete representations from the statistics of natural images , 2009, Optical Engineering + Applications.

[67]  Richard E. Turner,et al.  A Structured Model of Video Reproduces Primary Visual Cortical Organisation , 2009, PLoS Comput. Biol..

[68]  Eero P. Simoncelli,et al.  Nonlinear Extraction of Independent Components of Natural Images Using Radial Gaussianization , 2009, Neural Computation.

[69]  Bruno A. Olshausen,et al.  Learning transport operators for image manifolds , 2009, NIPS.

[70]  Michael S. Lewicki,et al.  Emergence of complex cell properties by learning to generalize in natural scenes , 2009, Nature.

[71]  Charles F. Cadieu,et al.  Phase Coupling Estimation from Multivariate Phase Statistics , 2009, Neural Computation.

[72]  Charles F. Cadieu,et al.  Modeling Image Structure with Factorized Phase-Coupled Boltzmann Machines , 2010, ArXiv.

[73]  Haim Sompolinsky,et al.  Bayesian model of dynamic image stabilization in the visual system , 2010, Proceedings of the National Academy of Sciences.

[74]  Geoffrey E. Hinton,et al.  Modeling pixel means and covariances using factorized third-order boltzmann machines , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[75]  Jordan W. Suchow,et al.  Motion Silences Awareness of Visual Change , 2011, Current Biology.

[76]  H. B. Barlow,et al.  Possible Principles Underlying the Transformations of Sensory Messages , 2012 .

[77]  Christopher J. Rozell,et al.  A Common Network Architecture Efficiently Implements a Variety of Sparsity-Based Inference Problems , 2012, Neural Computation.

[78]  Xiaoyuan Zhu,et al.  Multi-Scale Spatial Concatenations of Local Features in Natural Scenes and Scene Classification , 2013, PloS one.

[79]  Aapo Hyvärinen,et al.  A three-layer model of natural image statistics , 2013, Journal of Physiology-Paris.

[80]  Roland Memisevic,et al.  Learning to Relate Images , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[81]  Simon Haykin,et al.  Improved Sparse Coding Under the Influence of Perceptual Attention , 2014, Neural Computation.

[82]  R. K. Simpson Nature Neuroscience , 2022 .