论文信息 - Global-and-local attention networks for visual recognition

Global-and-local attention networks for visual recognition

State-of-the-art deep convolutional networks (DCNs) such as squeeze-and- excitation (SE) residual networks implement a form of attention, also known as contextual guidance, which is derived from global image features. Here, we explore a complementary form of attention, known as visual saliency, which is derived from local image features. We extend the SE module with a novel global-and-local attention (GALA) module which combines both forms of attention -- resulting in state-of-the-art accuracy on ILSVRC. We further describe ClickMe.ai, a large-scale online experiment designed for human participants to identify diagnostic image regions to co-train a GALA network. Adding humans-in-the-loop is shown to significantly improve network accuracy, while also yielding visual features that are more interpretable and more similar to those used by human observers.

Thomas Serre | Drew Linsley | Sven Eberhardt | Dan Scheibler

[1] Jiebo Luo,et al. Image Captioning with Semantic Attention , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[2] Scott P. Johnson. Visual development in human infants: Binding features, surfaces, and objects , 2001 .

[3] Laura A. Dabbish,et al. Labeling images with a computer game , 2004, AAAI Spring Symposium: Knowledge Collection from Volunteer Contributors.

[4] Jung-Woo Ha,et al. Dual Attention Networks for Multimodal Reasoning and Matching , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Antonio Torralba,et al. Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[6] Ethan M. Meyers,et al. Visual Parsing After Recovery From Blindness , 2009, Psychological science.

[7] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[8] Qi Zhao,et al. SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9] Antonio Torralba,et al. Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence , 2016, Scientific Reports.

[10] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[11] John K. Tsotsos,et al. STNet: Selective Tuning of Convolutional Networks for Object Localization , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[12] Alexander J. Smola,et al. Stacked Attention Networks for Image Question Answering , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Qi Zhao,et al. Attentive Systems: A Survey , 2017, International Journal of Computer Vision.

[14] Walter J. Scheirer,et al. Perceptual Annotation: Measuring Human Vision to Improve Computer Vision , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15] Manuel Blum,et al. Peekaboom: a game for locating objects in images , 2006, CHI.

[16] Martin Wattenberg,et al. SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[17] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18] Wei Xu,et al. Look and Think Twice: Capturing Top-Down Visual Attention with Feedback Convolutional Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Jürgen Schmidhuber,et al. Deep Networks with Internal Selective Attention through Feedback Connections , 2014, NIPS.

[21] Vinay P. Namboodiri,et al. Differential Attention for Visual Question Answering , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22] Shimon Ullman,et al. Atoms of recognition in human and computer vision , 2016, Proceedings of the National Academy of Sciences.

[23] John K. Tsotsos,et al. Neurobiology of Attention , 2005 .

[24] Kristen Grauman,et al. Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds , 2011, CVPR 2011.

[25] Thomas Serre,et al. What are the Visual Features Underlying Human Versus Machine Vision? , 2017, 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).

[26] Xiaogang Wang,et al. Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27] Jonathan Krause,et al. Leveraging the Wisdom of the Crowd for Fine-Grained Recognition , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] A. Torralba,et al. The role of context in object recognition , 2007, Trends in Cognitive Sciences.

[29] Kate Saenko,et al. Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering , 2015, ECCV.

[30] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[31] Nikos Komodakis,et al. Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[32] C. Koch,et al. Computational modelling of visual attention , 2001, Nature Reviews Neuroscience.

[33] Michael S. Bernstein,et al. Visual7W: Grounded Question Answering in Images , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[34] Jessica B. Hamrick,et al. psiTurk: An open-source framework for conducting replicable behavioral experiments online , 2016, Behavior research methods.

[35] Thomas Serre,et al. How Deep is the Feature Analysis underlying Rapid Visual Categorization? , 2016, NIPS.

[36] Bohyung Han,et al. Progressive Attention Networks for Visual Attribute Prediction , 2016, BMVC.

[37] Pietro Perona,et al. Visual Recognition with Humans in the Loop , 2010, ECCV.

[38] Kavita Bala,et al. Inside-Outside Net: Detecting Objects in Context with Skip Pooling and Recurrent Neural Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39] Babak Saleh,et al. The Role of Typicality in Object Classification: Improving The Generalization Capacity of Convolutional Neural Networks , 2016, ArXiv.

[40] Li Fei-Fei,et al. Crowdsourcing in Computer Vision , 2016, Found. Trends Comput. Graph. Vis..

[41] Douglas Eck,et al. A Neural Representation of Sketch Drawings , 2017, ICLR.

[42] Geoffrey E. Hinton,et al. On the importance of initialization and momentum in deep learning , 2013, ICML.

[43] Antonio Torralba,et al. Learning visual biases from human imagination , 2014, NIPS.

[44] B. S. Manjunath,et al. Eye tracking assisted extraction of attentionally important objects from videos , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45] Dhruv Batra,et al. Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions? , 2016, EMNLP.

[46] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.