Global-and-local attention networks for visual recognition

State-of-the-art deep convolutional networks (DCNs) such as squeeze-and- excitation (SE) residual networks implement a form of attention, also known as contextual guidance, which is derived from global image features. Here, we explore a complementary form of attention, known as visual saliency, which is derived from local image features. We extend the SE module with a novel global-and-local attention (GALA) module which combines both forms of attention -- resulting in state-of-the-art accuracy on ILSVRC. We further describe ClickMe.ai, a large-scale online experiment designed for human participants to identify diagnostic image regions to co-train a GALA network. Adding humans-in-the-loop is shown to significantly improve network accuracy, while also yielding visual features that are more interpretable and more similar to those used by human observers.

[1]  Jiebo Luo,et al.  Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2]  Scott P. Johnson Visual development in human infants: Binding features, surfaces, and objects , 2001 .

[3]  Laura A. Dabbish,et al.  Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[4]  Jung-Woo Ha,et al.  Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[6]  Ethan M. Meyers,et al.  Visual Parsing After Recovery From Blindness , 2009, Psychological science.

[7]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[8]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Antonio Torralba,et al.  Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[10]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11]  John K. Tsotsos,et al.  STNet: Selective Tuning of Convolutional Networks for Object Localization , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12]  Alexander J. Smola,et al.  Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Qi Zhao,et al.  Attentive Systems: A Survey , 2017, International Journal of Computer Vision.

[14]  Walter J. Scheirer,et al.  Perceptual Annotation: Measuring Human Vision to Improve Computer Vision , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Manuel Blum,et al.  Peekaboom: a game for locating objects in images , 2006, CHI.

[16]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[17]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Wei Xu,et al.  Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jürgen Schmidhuber,et al.  Deep Networks with Internal Selective Attention through Feedback Connections , 2014, NIPS.

[21]  Vinay P. Namboodiri,et al.  Differential Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Shimon Ullman,et al.  Atoms of recognition in human and computer vision , 2016, Proceedings of the National Academy of Sciences.

[23]  John K. Tsotsos,et al.  Neurobiology of Attention , 2005 .

[24]  Kristen Grauman,et al.  Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[25]  Thomas Serre,et al.  What are the Visual Features Underlying Human Versus Machine Vision? , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[26]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jonathan Krause,et al.  Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  A. Torralba,et al.  The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[29]  Kate Saenko,et al.  Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[30]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[32]  C. Koch,et al.  Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[33]  Michael S. Bernstein,et al.  Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34]  Jessica B. Hamrick,et al.  psiTurk: An open-source framework for conducting replicable behavioral experiments online , 2016, Behavior research methods.

[35]  Thomas Serre,et al.  How Deep is the Feature Analysis underlying Rapid Visual Categorization? , 2016, NIPS.

[36]  Bohyung Han,et al.  Progressive Attention Networks for Visual Attribute Prediction , 2016, BMVC.

[37]  Pietro Perona,et al.  Visual Recognition with Humans in the Loop , 2010, ECCV.

[38]  Kavita Bala,et al.  Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Babak Saleh,et al.  The Role of Typicality in Object Classification: Improving The Generalization Capacity of Convolutional Neural Networks , 2016, ArXiv.

[40]  Li Fei-Fei,et al.  Crowdsourcing in Computer Vision , 2016, Found. Trends Comput. Graph. Vis..

[41]  Douglas Eck,et al.  A Neural Representation of Sketch Drawings , 2017, ICLR.

[42]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[43]  Antonio Torralba,et al.  Learning visual biases from human imagination , 2014, NIPS.

[44]  B. S. Manjunath,et al.  Eye tracking assisted extraction of attentionally important objects from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Dhruv Batra,et al.  Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[46]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.