The human visual system and CNNs can both support robust online translation tolerance following extreme displacements.

Visual translation tolerance refers to our capacity to recognize objects over a wide range of different retinal locations. Although translation is perhaps the simplest spatial transform that the visual system needs to cope with, the extent to which the human visual system can identify objects at previously unseen locations is unclear, with some studies reporting near complete invariance over 10{\deg} and other reporting zero invariance at 4{\deg} of visual angle. Similarly, there is confusion regarding the extent of translation tolerance in computational models of vision, as well as the degree of match between human and model performance. Here we report a series of eye-tracking studies (total N=70) demonstrating that novel objects trained at one retinal location can be recognized at high accuracy rates following translations up to 18{\deg}. We also show that standard deep convolutional networks (DCNNs) support our findings when pretrained to classify another set of stimuli across a range of locations, or when a Global Average Pooling (GAP) layer is added to produce larger receptive fields. Our findings provide a strong constraint for theories of human vision and help explain inconsistent findings previously reported with CNNs.

[1]  N. Logothetis,et al.  Shape representation in the inferior temporal cortex of monkeys , 1995, Current Biology.

[2]  Keiron O'Shea,et al.  An Introduction to Convolutional Neural Networks , 2015, ArXiv.

[3]  Irving Biederman,et al.  Translational and reflectional priming invariance: a retrospective. , 2009, Perception.

[4]  M. Fahle,et al.  Limited translation invariance of human visual pattern recognition , 1998, Perception & psychophysics.

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Casimir J. H. Ludwig,et al.  The visual system supports online translation invariance for object identification , 2015, Psychonomic Bulletin & Review.

[7]  Marco Zorzi,et al.  Deep generative learning of location-invariant visual word recognition , 2013, Front. Psychol..

[8]  P. Cavanagh,et al.  Retinotopy of the face aftereffect , 2008, Vision Research.

[9]  Stefan Palan,et al.  Prolific.ac—A subject pool for online experiments , 2017 .

[10]  S. Edelman,et al.  Imperfect Invariance to Object Translation in the Discrimination of Complex Shapes , 2001, Perception.

[11]  D. Howard,et al.  Synthesis of a Vocal Sound from the 3,000 year old Mummy, Nesyamun ‘True of Voice’ , 2020, Scientific Reports.

[12]  Jonathan Grainger,et al.  Computational models of location-invariant orthographic processing , 2013, Connect. Sci..

[13]  R. Vogels,et al.  Spatial sensitivity of macaque inferior temporal neurons , 2000, The Journal of comparative neurology.

[14]  Qiang Chen,et al.  Network In Network , 2013, ICLR.

[15]  I. Biederman,et al.  Evidence for Complete Translational and Reflectional Invariance in Visual Object Priming , 1991, Perception.

[16]  G W Humphreys,et al.  Varieties of Object Constancy , 1989, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[17]  I Biederman,et al.  Metric invariance in object recognition: a review and further evidence. , 1992, Canadian journal of psychology.

[18]  D. Pelli,et al.  Are faces processed like words? A diagnostic test for recognition by parts. , 2005, Journal of vision.

[19]  I. Rentschler,et al.  Peripheral vision and pattern recognition: a review. , 2011, Journal of vision.

[20]  J. O'Regan,et al.  Some results on translation invariance in the human visual system. , 1990, Spatial vision.

[21]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[22]  J. Maunsell,et al.  Anterior inferotemporal neurons of monkeys engaged in object recognition can be highly sensitive to object retinal position. , 2003, Journal of neurophysiology.

[23]  Jonathan D. Cohen,et al.  The physics of optimal decision making: a formal analysis of models of performance in two-alternative forced-choice tasks. , 2006, Psychological review.

[24]  Edmund T. Rolls,et al.  Invariant recognition of feature combinations in the visual system , 2002, Biological Cybernetics.

[25]  Gordon E. Legge,et al.  The viewpoint complexity of an object-recognition task , 1998, Vision Research.

[26]  Dwight J. Kravitz,et al.  How position dependent is visual object recognition? , 2008, Trends in Cognitive Sciences.

[27]  Thomas Serre,et al.  Deep Learning: The Good, the Bad, and the Ugly. , 2019, Annual review of vision science.

[28]  David D. Cox,et al.  Does Learned Shape Selectivity in Inferior Temporal Cortex Automatically Generalize Across Retinal Position? , 2008, The Journal of Neuroscience.

[29]  Zoe J. Oliver,et al.  Early differential sensitivity of evoked-potentials to local and global shape during the perception of three-dimensional objects , 2016, Neuropsychologia.

[30]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[31]  Eric Kauderer-Abrams,et al.  Quantifying Translation-Invariance in Convolutional Neural Networks , 2017, ArXiv.

[32]  Jeffrey N. Rouder,et al.  Bayesian inference for psychology. Part II: Example applications with JASP , 2017, Psychonomic Bulletin & Review.

[33]  S. Klein,et al.  Complete Transfer of Perceptual Learning across Retinal Locations Enabled by Double Training , 2008, Current Biology.

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[35]  Joshua A Solomon,et al.  Efficiencies for the statistics of size discrimination. , 2011, Journal of vision.

[36]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[37]  Jonas Kubilius,et al.  Can Deep Neural Networks Rival Human Ability to Generalize in Core Object Recognition , 2018 .

[38]  Irving Biederman,et al.  Invariance of long-term visual priming to scale, reflection, translation, and hemisphere , 2001, Vision Research.

[39]  Tomaso A. Poggio,et al.  Eccentricity Dependent Deep Neural Networks: Modeling Invariance in Human Vision , 2017, AAAI Spring Symposia.

[40]  Gemma Roig,et al.  Scale and translation-invariance for novel objects in human vision , 2020, Scientific Reports.

[41]  T. Poggio,et al.  Neural mechanisms of object recognition , 2002, Current Opinion in Neurobiology.

[42]  Nathan Intrator,et al.  Towards structural systematicity in distributed, statically bound visual representations , 2003, Cogn. Sci..

[43]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[44]  Weikai Qi A quantifiable testing of global translational invariance in Convolutional and Capsule Networks , 2018 .

[45]  John E. Hummel,et al.  Automatic priming for translation- and scale-invariant representations of object shape , 2002 .

[46]  I. Biederman Recognition-by-components: a theory of human image understanding. , 1987, Psychological review.

[47]  William R. Holmes,et al.  A Joint Deep Neural Network and Evidence Accumulation Modeling Approach to Human Decision-Making with Naturalistic Images , 2020, Computational Brain & Behavior.

[48]  I. Biederman,et al.  Dynamic binding in a neural network for shape recognition. , 1992, Psychological review.

[49]  J. Hummel,et al.  The role of attention in priming for left-right reflections of object images: evidence for a dual representation of object shape. , 1998, Journal of experimental psychology. Human perception and performance.

[50]  Hidetoshi Furukawa,et al.  Deep Learning for Target Classification from SAR Imagery: Data Augmentation and Translation Invariance , 2017, ArXiv.

[51]  Z. Wang Building Experiments with PsychoPy , 2021, Eye-Tracking with Python and Pylink.