Crowding reveals fundamental differences in local vs. global processing in humans and machines

Feedforward Convolutional Neural Networks (ffCNNs) have become state-of-the-art models both in computer vision and neuroscience. However, human-like performance of ffCNNs does not necessarily imply human-like computations. Previous studies have suggested that current ffCNNs do not make use of global shape information. However, it is currently unclear whether this reflects fundamental differences between ffCNN and human processing or is merely an artefact of how ffCNNs are trained. Here, we use visual crowding as a well-controlled, specific probe to test global shape computations. Our results provide evidence that ffCNNs cannot produce human-like global shape computations for principled architectural reasons. We lay out approaches that may address shortcomings of ffCNNs to provide better models of the human visual system.

[1]  Bilge Sayim,et al.  Grouping, pooling, and when bigger is better in visual crowding. , 2012, Journal of vision.

[2]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[3]  David Cox,et al.  Recurrent computations for visual pattern completion , 2017, Proceedings of the National Academy of Sciences.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[6]  Michael H. Herzog,et al.  How best to unify crowding? , 2016, Current Biology.

[7]  Surya Ganguli,et al.  A Unified Theory Of Early Visual Representations From Retina To Cortex Through Anatomically Constrained Deep CNNs , 2019, bioRxiv.

[8]  D. Levi,et al.  Visual crowding: a fundamental limit on conscious perception and object recognition , 2011, Trends in Cognitive Sciences.

[9]  Geoffrey E. Hinton,et al.  Dynamic Routing Between Capsules , 2017, NIPS.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[12]  Michael H. Herzog,et al.  Running Large-Scale Simulations on the Neurorobotics Platform to Understand Vision – The Case of Visual Crowding , 2019, Front. Neurorobot..

[13]  Koray Kavukcuoglu,et al.  Neural scene representation and rendering , 2018, Science.

[14]  Michael H. Herzog,et al.  Capsule Networks but not Classic CNNs Explain Global Visual Processing , 2019 .

[15]  Nikolaus Kriegeskorte,et al.  Recurrence is required to capture the representational dynamics of the human visual system , 2019, Proceedings of the National Academy of Sciences.

[16]  M. Herzog,et al.  Crowding, grouping, and object recognition: A matter of appearance. , 2015, Journal of vision.

[17]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[18]  Alexander S. Ecker,et al.  Comparing the ability of humans and DNNs to recognise closed contours in cluttered images , 2018, Journal of Vision.

[19]  Michael H Herzog,et al.  What crowding can tell us about object representations. , 2016, Journal of vision.

[20]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[21]  Walter J. Scheirer,et al.  PsyPhy: A Psychophysics Driven Evaluation Framework for Visual Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  Leon A. Gatys,et al.  Image Style Transfer Using Convolutional Neural Networks , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Surya Ganguli,et al.  Task-Driven Convolutional Recurrent Models of the Visual System , 2018, NeurIPS.

[24]  Leon A. Gatys,et al.  Image content is more important than Bouma’s Law for scene metamers , 2018, bioRxiv.

[25]  Nikolaus Kriegeskorte,et al.  Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation , 2014, PLoS Comput. Biol..

[26]  R. VanRullen Perception Science in the Age of Deep Neural Networks , 2017, Front. Psychol..

[27]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[28]  Gabriel Kreiman,et al.  Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning , 2016, ICLR.

[29]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[30]  Hongjing Lu,et al.  Deep convolutional networks do not classify based on global object shape , 2018, PLoS Comput. Biol..

[31]  Michael H. Herzog,et al.  Uncorking the bottleneck of crowding: a fresh look at object recognition , 2015, Current Opinion in Behavioral Sciences.

[32]  K. E. Overvliet,et al.  Perceptual grouping determines haptic contextual modulation , 2016, Vision Research.

[33]  Timo Aila,et al.  A Style-Based Generator Architecture for Generative Adversarial Networks , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Adrien Doerig,et al.  Beyond Bouma's window: How to explain global aspects of crowding? , 2019, PLoS Comput. Biol..

[35]  Daniel Oberfeld,et al.  Sequential Grouping Modulates the Effect of Non-Simultaneous Masking on Auditory Intensity Resolution , 2012, PloS one.

[36]  Thomas Serre,et al.  Disentangling neural mechanisms for perceptual grouping , 2019, ICLR.

[37]  W. Bair,et al.  Neural Coding for Shape and Texture in Macaque Area V4 , 2019, The Journal of Neuroscience.

[38]  Gerald Westheimer,et al.  Grouping of contextual elements that affect vernier thresholds. , 2007, Journal of vision.

[39]  G Westheimer,et al.  Gestalt Factors Modulate Basic Spatial Vision , 2010, Psychological science.

[40]  Jonas Kubilius,et al.  Evidence that recurrent circuits are critical to the ventral stream’s execution of core object recognition behavior , 2019, Nature Neuroscience.

[41]  Nikolaus Kriegeskorte,et al.  Recurrent Convolutional Neural Networks: A Better Model of Biological Object Recognition , 2017, bioRxiv.

[42]  Matthias Bethge,et al.  Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet , 2019, ICLR.

[43]  Tomaso A. Poggio,et al.  Do Deep Neural Networks Suffer from Crowding? , 2017, NIPS.

[44]  Nikolaus Kriegeskorte,et al.  Deep Neural Networks in Computational Neuroscience , 2019 .

[45]  Ramakrishna Chakravarthi,et al.  Object Recognition in Deep Convolutional Neural Networks is Fundamentally Different to That in Humans , 2019, ArXiv.

[46]  Nikolaus Kriegeskorte,et al.  Recurrent neural networks can explain flexible trading of speed and accuracy in biological vision , 2019, bioRxiv.

[47]  V. Lamme,et al.  The distinct modes of vision offered by feedforward and recurrent processing , 2000, Trends in Neurosciences.

[48]  Gregory Francis,et al.  Neural Dynamics of Grouping and Segmentation Explain Properties of Visual Crowding , 2017, Psychological review.

[49]  Alexei A. Efros,et al.  The Unreasonable Effectiveness of Deep Features as a Perceptual Metric , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[50]  Michael H. Herzog,et al.  Effects of grouping in contextual modulation , 2002, Nature.

[51]  Michael H. Herzog,et al.  Capsule networks as recurrent models of grouping and segmentation , 2019, PLoS Computational Biology.