Understanding the Computational Demands Underlying Visual Reasoning

Visual understanding requires comprehending complex visual relations between objects within a scene. Here, we seek to characterize the computational demands for abstract visual reasoning. We do this by systematically assessing the ability of modern deep convolutional neural networks (CNNs) to learn to solve the “Synthetic Visual Reasoning Test" (SVRT) challenge, a collection of twenty-three visual reasoning problems. Our analysis reveals a novel taxonomy of visual reasoning tasks, which can be primarily explained by both the type of relations (same-different vs. spatial-relation judgments) and the number of relations used to compose the underlying rules. Prior cognitive neuroscience work suggests that attention plays a key role in humans’ visual reasoning ability. To test this hypothesis, we extended the CNNs with spatial and feature-based attention mechanisms. In a second series of experiments, we evaluated the ability of these attention networks to learn to solve the SVRT challenge and found the resulting architectures to be much more efficient at solving the hardest of these visual reasoning tasks. Most importantly, the corresponding improvements on individual tasks partially explained our novel taxonomy. Overall, this work provides an granular computational account of visual reasoning and yields testable neuroscience predictions regarding the differential need for feature-based vs. spatial attention depending on the type of visual reasoning problem.

[1]  Yoshua Bengio,et al.  Inductive Biases for Deep Learning of Higher-Level Cognition , 2020, ArXiv.

[2]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[3]  Jon Driver,et al.  Covert Orienting in the Split Brain Reveals Hemispheric Specialization for Object-Based Attention , 1994 .

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[6]  Donald Geman,et al.  Visual Turing test for computer vision systems , 2015, Proceedings of the National Academy of Sciences.

[7]  Sebastian Stabinger,et al.  Evaluating the progress of deep learning for visual relational concepts , 2021, Journal of vision.

[8]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[9]  Pieter R. Roelfsema,et al.  Object-based attention in the primary visual cortex of the macaque monkey , 1998, Nature.

[10]  Armando Solar-Lezama,et al.  Unsupervised Learning by Program Synthesis , 2015, NIPS.

[11]  G. Logan On the ability to inhibit thought and action , 1984 .

[12]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[13]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[14]  Can Deep Convolutional Neural Networks Learn Same-Different Relations? , 2021 .

[15]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[16]  Gabriel Kreiman,et al.  Beyond the feedforward sweep: feedback computations in the visual cortex , 2020, Annals of the New York Academy of Sciences.

[17]  Richard S. Zemel,et al.  End-to-End Instance Segmentation and Counting with Recurrent Attention , 2016, ArXiv.

[18]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Thomas Serre,et al.  Recurrent neural circuits for contour detection , 2020, ICLR.

[20]  Jürgen Schmidhuber,et al.  Deep Networks with Internal Selective Attention through Feedback Connections , 2014, NIPS.

[21]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[22]  P. Perona,et al.  What do we perceive in a glance of a real-world scene? , 2007, Journal of vision.

[23]  Justus H. Piater,et al.  25 Years of CNNs: Can We Compare to Human Abstraction Capabilities? , 2016, ICANN.

[24]  Kenneth D. Forbus,et al.  Same/different in visual reasoning , 2021, Current Opinion in Behavioral Sciences.

[25]  Stefanie Jegelka,et al.  ResNet with one-neuron hidden layers is a Universal Approximator , 2018, NeurIPS.

[26]  Ashish Vaswani,et al.  Stand-Alone Self-Attention in Vision Models , 2019, NeurIPS.

[27]  Thomas Serre,et al.  Not-So-CLEVR: learning same–different relations strains feedforward neural networks , 2018, Interface Focus.

[28]  Xavier Boix,et al.  Do Neural Networks for Segmentation Understand Insideness? , 2020, Neural Computation.

[29]  Junkyung Kim,et al.  Differential Involvement of EEG Oscillatory Components in Sameness versus Spatial-Relation Visual Reasoning Tasks , 2020, eNeuro.

[30]  G. Marcus The Algebraic Mind: Integrating Connectionism and Cognitive Science , 2001 .

[31]  Dedre Gentner,et al.  Learning same and different relations: cross-species comparisons , 2021, Current Opinion in Behavioral Sciences.

[32]  Ruslan Salakhutdinov,et al.  Action Recognition using Visual Attention , 2015, NIPS 2015.

[33]  Ting Li,et al.  Comparing machines and humans on a visual categorization test , 2011, Proceedings of the National Academy of Sciences.

[34]  Adam Santoro,et al.  Attention over Learned Object Embeddings Enables Complex Visual Reasoning , 2020, NeurIPS.

[35]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[36]  R. Desimone,et al.  Neural mechanisms of selective visual attention. , 1995, Annual review of neuroscience.

[37]  Chaz Firestone,et al.  Performance vs. competence in human–machine comparisons , 2020, Proceedings of the National Academy of Sciences.

[38]  John K. Tsotsos,et al.  Different Binding Strategies for the Different Stages of Visual Recognition , 2007, BVAI.

[39]  Klaus Greff,et al.  On the Binding Problem in Artificial Neural Networks , 2020, ArXiv.

[40]  Thomas Serre,et al.  Learning what and where to attend , 2018, ICLR.

[41]  D. Yves von Cramon,et al.  Differential role of anterior prefrontal and premotor cortex in the processing of relational information , 2010, NeuroImage.

[42]  Eric E. Cooper,et al.  Attentional coding of categorical relations in scene perception: Evidence from the flicker paradigm , 2002, Psychonomic bulletin & review.

[43]  Alex O. Holcombe,et al.  Perceiving Spatial Relations via Attentional Tracking and Shifting , 2011, Current Biology.

[44]  Albert Postma,et al.  Retinotopic Mapping of Categorical and Coordinate Spatial Relation Processing in Early Visual Cortex , 2012, PloS one.

[45]  Timothy F. Brady,et al.  Contextual effects in visual working memory reveal hierarchically structured memory representations. , 2015, Journal of vision.

[46]  John E. Hummel,et al.  Working memory for relations among objects , 2014, Attention, perception & psychophysics.

[47]  G. Logan Spatial attention and the apprehension of spatial relations. , 1994, Journal of experimental psychology. Human perception and performance.

[48]  Quoc V. Le,et al.  Attention Augmented Convolutional Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[49]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Matthieu Cord,et al.  Training data-efficient image transformers & distillation through attention , 2020, ICML.

[51]  James K. Kroger,et al.  Recruitment of anterior dorsolateral prefrontal cortex in human reasoning: a parametric study of relational complexity. , 2002, Cerebral cortex.

[52]  Claudio Gennaro,et al.  Solving the Same-Different Task with Convolutional Neural Networks , 2021, Pattern Recognit. Lett..

[53]  Penelope A. Lewis,et al.  Program synthesis performance constrained by non-linear spatial relations in Synthetic Visual Reasoning Test , 2019, ArXiv.

[54]  Bin Li,et al.  Deformable DETR: Deformable Transformers for End-to-End Object Detection , 2020, ICLR.

[55]  Thomas Serre,et al.  Same-different conceptualization: a machine vision perspective , 2021, Current Opinion in Behavioral Sciences.

[56]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[57]  Wei Xu,et al.  ABC-CNN: An Attention Based Convolutional Neural Network for Visual Question Answering , 2015, ArXiv.

[58]  Claudio Gennaro,et al.  Recurrent Vision Transformer for Solving Visual Reasoning Problems , 2021, ICIAP.

[59]  T. Moore,et al.  Neural Mechanisms of Selective Visual Attention. , 2017, Annual review of psychology.

[60]  Nicolas Usunier,et al.  End-to-End Object Detection with Transformers , 2020, ECCV.

[61]  R. Shepard,et al.  Mental Rotation of Three-Dimensional Objects , 1971, Science.

[62]  Thomas Serre,et al.  Global-and-local attention networks for visual recognition , 2018, ArXiv.

[63]  C M Moore,et al.  Visual attention and the apprehension of spatial relations: The case of depth , 2001, Perception & psychophysics.

[64]  Alexander Kolesnikov,et al.  MLP-Mixer: An all-MLP Architecture for Vision , 2021, NeurIPS.

[65]  Matthias Bethge,et al.  Five points to check when comparing visual perception in humans and machines , 2021, Journal of vision.