Saliency-based Sequential Image Attention with Multiset Prediction

Humans process visual scenes selectively and sequentially using attention. Central to models of human visual attention is the saliency map. We propose a hierarchical visual architecture that operates on a saliency map and uses a novel attention mechanism to sequentially focus on salient regions and take additional glimpses within those regions. The architecture is motivated by human visual attention, and is used for multi-label image classification on a novel multiset task, demonstrating that it achieves high precision and recall while localizing objects with its attention. Unlike conventional multi-label image classification models, the model supports multiset prediction due to a reinforcement-learning based training process that allows for arbitrary label permutation and multiple instances per label.

[1]  Ziad M Hafed,et al.  How is visual salience computed in the brain? Insights from behaviour, neurobiology and modelling , 2017, Philosophical Transactions of the Royal Society B: Biological Sciences.

[2]  Koray Kavukcuoglu,et al.  Visual Attention , 2020, Computational Models for Cognitive Vision.

[3]  Rita Cucchiara,et al.  Paying More Attention to Saliency , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[4]  P. Cavanagh,et al.  Tracking multiple targets with multifocal attention , 2005, Trends in Cognitive Sciences.

[5]  Joshua B. Tenenbaum,et al.  Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation , 2016, NIPS.

[6]  Eileen Kowler Eye movements: The past 25years , 2011, Vision Research.

[7]  D. Somers,et al.  Multiple Spotlights of Attentional Selection in Human Visual Cortex , 2004, Neuron.

[8]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[9]  Wei Xu,et al.  CNN-RNN: A Unified Framework for Multi-label Image Classification , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yale Song,et al.  Improving Pairwise Ranking for Multi-label Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Xiao Liu,et al.  Localizing by Describing: Attribute-Guided Attention Localization for Fine-Grained Recognition , 2016, AAAI.

[12]  Xiaogang Wang,et al.  Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[14]  Cristian Sminchisescu,et al.  Action from Still Image Dataset and Inverse Optimal Control to Learn Task Specific Visual Scanpaths , 2013, NIPS.

[15]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[16]  Trevor Darrell,et al.  Timely Object Recognition , 2012, NIPS.

[17]  Bohyung Han,et al.  Hierarchical Attention Networks , 2016, ArXiv.

[18]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[19]  Geoffrey E. Hinton,et al.  Feudal Reinforcement Learning , 1992, NIPS.

[20]  R. Klein,et al.  Inhibition of return , 2000, Trends in Cognitive Sciences.

[21]  Trevor Darrell,et al.  Generating Visual Explanations , 2016, ECCV.

[22]  Andrew Zisserman,et al.  Spatial Transformer Networks , 2015, NIPS.

[23]  Bohyung Han,et al.  Progressive Attention Networks for Visual Attribute Prediction , 2016, BMVC.

[24]  Jorma Laaksonen,et al.  Towards Instance Segmentation with Object Priority: Prominent Object Detection and Recognition , 2017, 1704.07402.

[25]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[26]  Yaffa Yeshurun,et al.  Covert attention increases spatial resolution with or without masks: support for signal enhancement. , 2002, Journal of vision.

[27]  Yoshua Bengio,et al.  Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[28]  M. Carrasco Visual attention: The past 25 years , 2011, Vision Research.

[29]  Miriam Bellver,et al.  Hierarchical Object Detection with Deep Reinforcement Learning , 2016, NIPS 2016.

[30]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[32]  Bruno A. Olshausen,et al.  Emergence of foveal image sampling from learning to attend in visual scenes , 2016, ICLR.

[33]  Tom Schaul,et al.  FeUdal Networks for Hierarchical Reinforcement Learning , 2017, ICML.

[34]  Cristian Sminchisescu,et al.  Reinforcement Learning for Visual Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[36]  Pieter Abbeel,et al.  Gradient Estimation Using Stochastic Computation Graphs , 2015, NIPS.

[37]  H. Pashler,et al.  Evidence for split attentional foci. , 2000, Journal of experimental psychology. Human perception and performance.

[38]  Erhardt Barth,et al.  Image classification with recurrent attention models , 2016, 2016 IEEE Symposium Series on Computational Intelligence (SSCI).

[39]  Brendan J. Frey,et al.  Learning Wake-Sleep Recurrent Attention Models , 2015, NIPS.

[40]  Svetlana Lazebnik,et al.  Active Object Localization with Deep Reinforcement Learning , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[41]  Geoffrey E. Hinton,et al.  Learning to combine foveal glimpses with a third-order Boltzmann machine , 2010, NIPS.

[42]  Ronald J. Williams,et al.  Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning , 2004, Machine Learning.

[43]  V. Lamme,et al.  The distinct modes of vision offered by feedforward and recurrent processing , 2000, Trends in Neurosciences.

[44]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[45]  Jillian H. Fecteau,et al.  Salience, relevance, and firing: a priority map for target selection , 2006, Trends in Cognitive Sciences.

[46]  David Whitney,et al.  Attention Narrows Position Tuning of Population Responses in V1 , 2009, Current Biology.

[47]  Alex Graves,et al.  Recurrent Models of Visual Attention , 2014, NIPS.

[48]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.