The computational magic of the ventral stream: sketch of a theory (and why some deep architectures work).

This paper explores the theoretical consequences of a simple assumption: the computational goal of the feedforward path in the ventral stream – from V1, V2, V4 and to IT – is to discount image transformations, after learning them during development. Part I assumes that a basic neural operation consists of dot products between input vectors and synaptic weights – which can be modified by learning. It proves that a multi-layer hierarchical architecture of dot-product modules can learn in an unsupervised way geometric transformations of images and then achieve the dual goals of invariance to global affine transformations and of robustness to diffeomorphisms. These architectures learn in an unsupervised way to be automatically invariant to transformations of a new object, achieving the goal of recognition with one or very few labeled examples. The theory of Part I should apply to a varying degree to a range of hierarchical architectures such as HMAX, convolutional networks and related feedforward models of the visual system and formally characterize some of their properties. A linking conjecture in Part II assumes that storage of transformed templates during development – a stage implied by the theory of Part I – takes place via Hebbian-like developmental learning at the synapses in visual cortex. It follows that the cells’ tuning will effectively converge during development to the top eigenvectors of the covariance of their inputs. The solution of the associated eigenvalue problem is surprisingly tolerant of details of the image spectrum. It predicts quantitative properties of the tuning of cells in the first layer – identified with simple cells in V1; in particular, they should converge during development to oriented Gabor-like wavelets with frequency inversely proportional to the size of an elliptic Gaussian envelope – in agreement with data from the cat, the macaque and the mouse. A similar analysis leads to predictions about receptive field tuning in higher visual areas – such as V2 and V4 – and in particular about the size of simple and complex receptive fields in each of the areas. For non-affine transformations of the image – for instance induced by out-ofplane rotations of a 3D object or non-rigid deformations – it is possible to prove that the dot-product technique of Part I can provide approximate invariance for certain classes of objects. Thus Part III considers modules that are class-specific – such as the face, the word and the body area – and predicts several properties of the macaque cortex face patches characterized by Freiwald and Tsao, including a patch (called AL) which contains mirror symmetric cells and is the input to the pose-invariant patch (AM). Taken together, the results of the papers suggest a computational role for the ventral stream and derive detailed properties of the architecture and of the tuning of cells, including the role and quantitative properties of neurons in V 1. A surprising implication of these theoretical results is that the computational goals and several of the tuning properties of cells in the ventral stream may follow from symmetry properties (in the sense of physics) of the visual world through a process of unsupervised correlational learning, based on Hebbian synapses.

[1]  Doris Y. Tsao,et al.  Patches with Links: A Unified System for Processing Faces in the Macaque Temporal Lobe , 2008, Science.

[2]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[3]  Joel Z. Leibo,et al.  Learning Generic Invariances in Object Recognition: Translation and Scale , 2010 .

[4]  W. M. Keck,et al.  Highly Selective Receptive Fields in Mouse Visual Cortex , 2008, The Journal of Neuroscience.

[5]  Doris Y. Tsao,et al.  Faces and objects in macaque cerebral cortex , 2003, Nature Neuroscience.

[6]  Eero P. Simoncelli,et al.  Metamers of the ventral stream , 2011, Nature Neuroscience.

[7]  Tomaso Poggio,et al.  Models of object recognition , 2000, Nature Neuroscience.

[8]  Joel Z. Leibo,et al.  How can cells in the anterior medial face patch be viewpoint invariant , 2011 .

[9]  J. Hegdé,et al.  Selectivity for Complex Shapes in Primate Visual Area V2 , 2000, The Journal of Neuroscience.

[10]  Syed Twareque Ali,et al.  Two-Dimensional Wavelets and their Relatives , 2004 .

[11]  Tomaso Poggio,et al.  Fast Readout of Object Identity from Macaque Inferior Temporal Cortex , 2005, Science.

[12]  Nancy Kanwisher,et al.  A cortical representation of the local visual environment , 1998, Nature.

[13]  J. P. Jones,et al.  An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. , 1987, Journal of neurophysiology.

[14]  Antonio Torralba,et al.  Statistics of natural image categories , 2003, Network.

[15]  L. Maffei,et al.  Spontaneous impulse activity of rat retinal ganglion cells in prenatal life. , 1988, Science.

[16]  Charles F Stevens Preserving properties of object shape by computations in primary visual cortex. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Ringach Spatial structure and symmetry of simple-cell receptive fields in macaque primary visual cortex. , 2002, Journal of neurophysiology.

[18]  S. Nelson,et al.  Homeostatic plasticity in the developing nervous system , 2004, Nature Reviews Neuroscience.

[19]  L. Rosasco THE COMPUTATIONAL MAGIC OF THE VENTRAL STREAM , 2011 .

[20]  Doris Y. Tsao,et al.  A Cortical Region Consisting Entirely of Face-Selective Cells , 2006, Science.

[21]  Tomaso A. Poggio,et al.  A Canonical Neural Circuit for Cortical Nonlinear Operations , 2008, Neural Computation.

[22]  Niko Wilbert,et al.  Invariant Object Recognition and Pose Estimation with Slow Feature Analysis , 2011, Neural Computation.

[23]  J. Austin Associative memory , 1987 .

[24]  S. Gerber,et al.  Unsupervised Natural Experience Rapidly Alters Invariant Object Representation in Visual Cortex , 2008 .

[25]  Rajesh P. N. Rao,et al.  Learning Lie Groups for Invariant Visual Perception , 1998, NIPS.

[26]  M. Ferraro,et al.  Relationship between integral transform invariances and Lie group theory , 1988 .

[27]  N. Kanwisher,et al.  Visual word processing and experiential origins of functional selectivity in human extrastriate cortex , 2007, Proceedings of the National Academy of Sciences.

[28]  Cosimo Urgesi,et al.  Magnetic Stimulation of Extrastriate Body Area Impairs Visual Processing of Nonfacial Body Parts , 2004, Current Biology.

[29]  W. Hoffman The Lie algebra of visual perception , 1966 .

[30]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[31]  D. V. van Essen,et al.  Selectivity for polar, hyperbolic, and Cartesian gratings in macaque visual cortex. , 1993, Science.

[32]  Joel Z. Leibo,et al.  Why The Brain Separates Face Recognition From Object Recognition , 2011, NIPS.

[33]  W. Pitts,et al.  How we know universals; the perception of auditory and visual forms. , 1947, The Bulletin of mathematical biophysics.

[34]  Y. LeCun,et al.  Learning methods for generic object recognition with invariance to pose and lighting , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[35]  Edmund T. Rolls,et al.  Invariant Object Recognition in the Visual System with Novel Views of 3D Objects , 2002, Neural Computation.

[36]  David Cox,et al.  Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook , 2011, CVPR 2011 WORKSHOPS.

[37]  R. Pérez,et al.  Perception of Random Dot Interference Patterns , 1973, Nature.

[38]  P. Downing,et al.  Selectivity for the human body in the fusiform gyrus. , 2005, Journal of neurophysiology.

[39]  Jean-Michel Morel,et al.  ASIFT: An Algorithm for Fully Affine Invariant Comparison , 2011, Image Process. Line.

[40]  E. Oja Simplified neuron model as a principal component analyzer , 1982, Journal of mathematical biology.

[41]  C. Shatz,et al.  Transient period of correlated bursting activity during development of the mammalian retina , 1993, Neuron.

[42]  J. Koenderink The brain a geometry engine , 1990, Psychological research.

[43]  Doris Y. Tsao,et al.  Functional Compartmentalization and Viewpoint Generalization Within the Macaque Face-Processing System , 2010, Science.

[44]  Erkki Oja,et al.  Principal components, minor components, and linear neural networks , 1992, Neural Networks.

[45]  N. Kanwisher,et al.  A Cortical Area Selective for Visual Processing of the Human Body , 2001, Science.

[46]  Joel Z. Leibo,et al.  Learning and disrupting invariance in visual recognition with a temporal association rule , 2011, Front. Comput. Neurosci..

[47]  J. DiCarlo,et al.  Unsupervised Natural Visual Experience Rapidly Reshapes Size-Invariant Object Representation in Inferior Temporal Cortex , 2010, Neuron.

[48]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[49]  R C Reid,et al.  Efficient Coding of Natural Scenes in the Lateral Geniculate Nucleus: Experimental Test of a Computational Theory , 1996, The Journal of Neuroscience.

[50]  Yann LeCun,et al.  Learning Invariant Feature Hierarchies , 2012, ECCV Workshops.

[51]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[52]  D. Ruderman The statistics of natural images , 1994 .

[53]  J. Devlin,et al.  Triple Dissociation of Faces, Bodies, and Objects in Extrastriate Cortex , 2009, Current Biology.

[54]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[55]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[56]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[57]  S Lehéricy,et al.  The visual word form area: spatial and temporal characterization of an initial stage of reading in normal subjects and posterior split-brain patients. , 2000, Brain : a journal of neurology.

[58]  Stéphane Mallat,et al.  Group Invariant Scattering , 2011, ArXiv.

[59]  Marc'Aurelio Ranzato,et al.  Building high-level features using large scale unsupervised learning , 2011, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[60]  A. Leff,et al.  Structural anatomy of pure and hemianopic alexia , 2006, Journal of Neurology, Neurosurgery & Psychiatry.

[61]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[62]  Thomas Serre,et al.  Learning complex cell invariance from natural videos: A plausibility proof , 2007 .

[63]  T. Poggio,et al.  Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries , 1992 .

[64]  Y. Meyer,et al.  Wavelets and Filter Banks , 1991 .

[65]  Thomas Serre,et al.  A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex , 2005 .

[66]  Scott D. Slotnick,et al.  The Visual Word Form Area , 2013 .

[67]  K. Gröchenig Multivariate Gabor frames and sampling of entire functions of several variables , 2011 .

[68]  Juha Karhunen,et al.  Stability of Oja's PCA Subspace Rule , 1994, Neural Computation.

[69]  H. Bülthoff,et al.  Effects of temporal association on recognition memory , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[70]  Ronen Basri,et al.  Recognition by Linear Combinations of Models , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[71]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[72]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[73]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[74]  Yoshua Bengio,et al.  Convolutional networks for images, speech, and time series , 1998 .

[75]  J. DiCarlo,et al.  'Breaking' position-invariant object recognition , 2005, Nature Neuroscience.

[76]  Tomaso Poggio,et al.  Learning to discount transformations as the computational goal of visual cortex , 2011 .

[77]  C. Urgesi,et al.  The Neural Basis of Body Form and Body Action Agnosia , 2008, Neuron.

[78]  Terence D. Sanger,et al.  Optimal unsupervised learning in a single-layer linear feedforward neural network , 1989, Neural Networks.

[79]  Tomaso Poggio,et al.  From primal templates to invariant recognition , 2010 .

[80]  A. Grossmann,et al.  TRANSFORMS ASSOCIATED TO SQUARE INTEGRABLE GROUP REPRESENTATION. 2. EXAMPLES , 1986 .