Passive attention in artificial neural networks predicts human visual selectivity

Developments in machine learning interpretability techniques over the past decade have provided new tools to observe the image regions that are most informative for classification and localization in artificial neural networks (ANNs). Are the same regions similarly informative to human observers? Using data from 79 new experiments and 7,810 participants, we show that passive attention techniques reveal a significant overlap with human visual selectivity estimates derived from 6 distinct behavioral tasks including visual discrimination, spatial localization, recognizability, free-viewing, cued-object search, and saliency search fixations. We find that input visualizations derived from relatively simple ANN architectures probed using guided backpropagation methods are the best predictors of a shared component in the joint variability of the human measures. We validate these correlational results with causal manipulations using recognition experiments. We show that images masked with ANN attention maps were easier for humans to classify than control masks in a speeded recognition experiment. Similarly, we find that recognition performance in the same ANN models was likewise influenced by masking input images using human visual selectivity maps. This work contributes a new approach to evaluating the biological and psychological validity of leading ANNs as models of human vision: by examining their similarities and differences in terms of their visual selectivity to the information contained in images.

[1]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[2]  Abhishek Das,et al.  Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[3]  Pieter Blignaut,et al.  Convolutional Neural Network-Based Methods for Eye Gaze Estimation: A Survey , 2020, IEEE Access.

[4]  Caitlin Lustig,et al.  How We've Taught Algorithms to See Identity: Constructing Race and Gender in Image Databases for Facial Analysis , 2020, Proc. ACM Hum. Comput. Interact..

[5]  Olga Russakovsky,et al.  REVISE: A Tool for Measuring and Mitigating Bias in Visual Datasets , 2020, International Journal of Computer Vision.

[6]  Daniel L. K. Yamins,et al.  A Task-Optimized Neural Network Replicates Human Auditory Behavior, Predicts Brain Responses, and Reveals a Cortical Processing Hierarchy , 2018, Neuron.

[7]  Thomas L. Griffiths,et al.  Learning to generalize like humans using basic-level object labels , 2019, Journal of Vision.

[8]  Taylor R. Hayes,et al.  Meaning-based guidance of attention in scenes as revealed by meaning maps , 2017, Nature Human Behaviour.

[9]  Quoc V. Le,et al.  EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks , 2019, ICML.

[10]  Wojciech Zaremba,et al.  Deep Neural Networks Predict Category Typicality Ratings for Images , 2015, CogSci.

[11]  Hironobu Fujiyoshi,et al.  Attention Branch Network: Learning of Attention Mechanism for Visual Explanation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[13]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[14]  Anna Shcherbina,et al.  Not Just a Black Box: Learning Important Features Through Propagating Activation Differences , 2016, ArXiv.

[15]  Thomas Brox,et al.  Striving for Simplicity: The All Convolutional Net , 2014, ICLR.

[16]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[17]  Daniel L. K. Yamins,et al.  An Optimization-Based Approach to Understanding Sensory Systems , 2019 .

[18]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[19]  Deborah Silver,et al.  Feature Visualization , 1994, Scientific Visualization.

[20]  Ling Shao,et al.  Understanding More About Human and Machine Attention in Deep Neural Networks , 2021, IEEE Transactions on Multimedia.

[21]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[22]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[23]  Olga Russakovsky,et al.  Towards Fairness in Visual Recognition: Effective Strategies for Bias Mitigation , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Shenmin Zhang,et al.  What do saliency models predict? , 2014, Journal of vision.

[25]  Thomas Serre,et al.  Learning what and where to attend , 2018, ICLR.

[26]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[27]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[28]  Stanley Finger,et al.  Origins of neuroscience: A history of explorations into brain function. , 1994 .

[29]  Aran Nayebi,et al.  Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs , 2019, NeurIPS.

[30]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[31]  George A. Alvarez,et al.  Deepnets do not need category supervision to predict visual system responses to objects , 2020 .

[32]  Joshua C. Peterson,et al.  Capturing human categorization of natural images by combining deep networks and cognitive models , 2020, Nature Communications.

[33]  Alexander Kolesnikov,et al.  How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers , 2021, ArXiv.

[34]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[35]  Stan Sclaroff,et al.  Exploiting Surroundedness for Saliency Detection: A Boolean Map Approach , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Philip H. S. Torr,et al.  Learn To Pay Attention , 2018, ICLR.

[37]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[39]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[40]  Mohammed Bennamoun,et al.  CAMERAS: Enhanced Resolution And Sanity preserving Class Activation Mapping for image saliency , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[42]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  David C. Noelle,et al.  Do Humans Look Where Deep Convolutional Neural Networks "Attend"? , 2019, CogSci.

[44]  Gwendolyn Rehrig,et al.  Meaning Guides Attention during Real-World Scene Description , 2018, Scientific Reports.

[45]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[46]  Bria Long,et al.  The role of textural statistics vs. outer contours in deep CNN and neural responses to objects , 2018 .

[47]  Zijian Zhang,et al.  Score-CAM: Score-Weighted Visual Explanations for Convolutional Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[48]  Jordan W. Suchow,et al.  Serial reproduction reveals the geometry of visuospatial representations , 2021, Proceedings of the National Academy of Sciences.

[49]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Bolei Zhou,et al.  Places: A 10 Million Image Database for Scene Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[51]  Thomas Serre,et al.  What are the Visual Features Underlying Human Versus Machine Vision? , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).