THE COMPUTATIONAL MAGIC OF THE VENTRAL STREAM

I argue that the sample complexity of (biological, feedforward) object recognition is mostly due to geometric image transformations and conjecture that a main goal of the ventral stream – V1, V2, V4 and IT – is to learn-and-discount image transformations. In the first part of the paper I describe a class of simple and biologically plausible memorybased modules that learn transformations from unsupervised visual experience. The main theorems show that these modules provide (for every object) a signature which is invariant to local affine transformations and approximately invariant for other transformations. I also prove that, in a broad class of hierarchical architectures, signatures remain invariant from layer to layer. The identification of these memory-based modules with complex (and simple) cells in visual areas leads to a theory of invariant recognition for the ventral stream. In the second part, I outline a theory about hierarchical architectures that can learn invariance to transformations. I show that the memory complexity of learning affine transformations is drastically reduced in a hierarchical architecture that factorizes transformations in terms of the subgroup of translations and the subgroups of rotations and scalings. I then show how translations are automatically selected as the only learnable transformations during development by enforcing small apertures – eg small receptive fields – in the first layer. In a third part I show that the transformations represented in each area can be optimized in terms of storage and robustness, as a consequence determining the tuning of the neurons in the area, rather independently (under normal conditions) of the statistics of natural images. I describe a model of learning that can be proved to have this property, linking in an elegant way the spectral properties of the signatures with the tuning of receptive fields in different areas. A surprising implication of these theoretical results is that the computational goals and some of the tuning properties of cells in the ventral stream may follow from symmetry properties (in the sense of physics) of the visual world through a process of unsupervised correlational learning, based on Hebbian synapses. In particular, simple and complex cells do not directly care about oriented bars: their tuning is a side effect of their role in translation invariance. Across the whole ventral stream the preferred features reported for neurons in different areas are only a symptom of the invariances computed and represented. The results of each of the three parts stand on their own independently of each other. Together this theory-in-fieri makes several broad predictions, some of which are: • invariance to small transformations in early areas (eg translations in V1) may underly stability of visual perception (suggested by Stu Geman); • each cell’s tuning properties are shaped by visual experience of image transformations during developmental and adult plasticity; • simple cells are likely to be the same population as complex cells, arising from different convergence of the Hebbian learning rule. The input to complex “complex” cells are dendritic branches with simple cell properties; • class-specific transformations are learned and represented at the top of the ventral stream hierarchy; thus class-specific modules – such as faces, places and possibly body areas – should exist in IT; • the type of transformations that are learned from visual experience depend on the size of the receptive fields and thus on the area (layer in the models) – assuming that the size increases with layers; • the mix of transformations learned in each area influences the tuning properties of the cells – oriented bars in V1+V2, radial and spiral patterns in V4 up to class specific tuning in AIT (eg face tuned cells); • features must be discriminative and invariant: invariance to transformations is the primary determinant of the tuning of cortical neurons rather than statistics of natural images. The theory is broadly consistent with the current version of HMAX. It explains it and extend it in terms of unsupervised learning, a broader class of transformation invariance and higher level modules. The goal of this paper is to sketch a comprehesive theory with little regard for mathematical niceties. If the theory turns out to be useful there will be scope for deep mathematics, ranging from group representation tools to wavelet theory to dynamics of learning. THE COMPUTATIONAL MAGIC OF THE VENTRAL STREAM 3

[1]  W. Pitts,et al.  How we know universals; the perception of auditory and visual forms. , 1947, The Bulletin of mathematical biophysics.

[2]  W. Hoffman The Lie algebra of visual perception , 1966 .

[3]  R. Pérez,et al.  Perception of Random Dot Interference Patterns , 1973, Nature.

[4]  M. Hirsch,et al.  Differential Equations, Dynamical Systems, and Linear Algebra , 1974 .

[5]  D. Marr,et al.  An Information Processing Approach to Understanding the Visual Cortex , 1980 .

[6]  Peter Földiák,et al.  Learning Invariance from Transformation Sequences , 1991, Neural Comput..

[7]  Ronen Basri,et al.  Recognition by Linear Combinations of Models , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  T. Poggio,et al.  Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries , 1992 .

[9]  Juha Karhunen,et al.  Stability of Oja's PCA Subspace Rule , 1994, Neural Computation.

[10]  Rajesh P. N. Rao,et al.  Learning Lie Groups for Invariant Visual Perception , 1998, NIPS.

[11]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[12]  J. Hegdé,et al.  Selectivity for Complex Shapes in Primate Visual Area V2 , 2000, The Journal of Neuroscience.

[13]  Tomaso Poggio,et al.  Models of object recognition , 2000, Nature Neuroscience.

[14]  Terrence J. Sejnowski,et al.  Slow Feature Analysis: Unsupervised Learning of Invariances , 2002, Neural Computation.

[15]  S. Nelson,et al.  Homeostatic plasticity in the developing nervous system , 2004, Nature Reviews Neuroscience.

[16]  J. Koenderink,et al.  Receptive field families , 1990, Biological Cybernetics.

[17]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[18]  Charles F Stevens Preserving properties of object shape by computations in primary visual cortex. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[19]  G. Palm,et al.  On associative memory , 2004, Biological Cybernetics.

[20]  Tomaso Poggio,et al.  Fast Readout of Object Identity from Macaque Inferior Temporal Cortex , 2005, Science.

[21]  Thomas Serre,et al.  A Theory of Object Recognition: Computations and Circuits in the Feedforward Path of the Ventral Stream in Primate Visual Cortex , 2005 .

[22]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[23]  Thomas Serre,et al.  Robust Object Recognition with Cortex-Like Mechanisms , 2007, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Tomaso A. Poggio,et al.  A Canonical Neural Circuit for Cortical Nonlinear Operations , 2008, Neural Computation.

[25]  Geoffrey E. Hinton,et al.  Learning to Represent Spatial Transformations with Factored Higher-Order Boltzmann Machines , 2010, Neural Computation.

[26]  Tomaso Poggio,et al.  From primal templates to invariant recognition , 2010 .

[27]  Joel Z. Leibo,et al.  Invariant Recognition of Objects by Vision , 2010 .

[28]  Lorenzo Rosasco,et al.  Publisher Accessed Terms of Use Detailed Terms Mathematics of the Neural Response , 2022 .

[29]  Joel Z. Leibo,et al.  Learning Generic Invariances in Object Recognition: Translation and Scale , 2010 .

[30]  Jean-Michel Morel,et al.  ASIFT: An Algorithm for Fully Affine Invariant Comparison , 2011, Image Process. Line.

[31]  Tomaso Poggio,et al.  Learning to discount transformations as the computational goal of visual cortex , 2011 .

[32]  K. Gröchenig Multivariate Gabor frames and sampling of entire functions of several variables , 2011 .

[33]  Joel Z. Leibo,et al.  How can cells in the anterior medial face patch be viewpoint invariant , 2011 .

[34]  T. Poggio The computational magic of the ventral stream , 2012 .