Tidying Deep Saliency Prediction Architectures

Learning computational models for visual attention (saliency estimation) is an effort to inch machines/robots closer to human visual cognitive abilities. Data-driven efforts have dominated the landscape since the introduction of deep neural network architectures. In deep learning research, the choices in architecture design are often empirical and frequently lead to more complex models than necessary. The complexity, in turn, hinders the application requirements. In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions. We review the existing state of the art models on these four components and propose novel and simpler alternatives. As a result, we propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks. SimpleNet is an optimized encoder-decoder architecture and brings notable performance gains on the SALICON dataset (the largest saliency benchmark). MDNSal is a parametric model that directly predicts parameters of a GMM distribution and is aimed to bring more interpretability to the prediction maps. The proposed saliency models can be inferred at 25fps, making them suitable for real-time applications. Code and pre-trained models are available at this https URL.

[1]  S. Srihari Mixture Density Networks , 1994 .

[2]  Zhou Zhao,et al.  Saliency based proposal refinement in robotic vision , 2017, 2017 IEEE International Conference on Real-time Computing and Robotics (RCAR).

[3]  Leon A. Gatys,et al.  Understanding Low- and High-Level Contributions to Fixation Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[4]  Ali Borji,et al.  Scene classification with a sparse set of salient regions , 2011, 2011 IEEE International Conference on Robotics and Automation.

[5]  Gordon Cheng,et al.  Attention-based active visual search for mobile robots , 2018, Autonomous Robots.

[6]  Xiongkuo Min,et al.  How is Gaze Influenced by Image Transformations? Dataset and Model , 2019, IEEE Transactions on Image Processing.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[9]  Gernot A. Fink,et al.  Saliency-based identification and recognition of pointed-at objects , 2010, 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[10]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[11]  Frédo Durand,et al.  Where Should Saliency Models Look Next? , 2016, ECCV.

[12]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[13]  Tianming Liu,et al.  Learning to Predict Eye Fixations via Multiresolution Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[14]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[15]  Stan Sclaroff,et al.  Saliency Detection: A Boolean Map Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[16]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[17]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[18]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[19]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Sen Jia,et al.  EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction , 2018, Image Vis. Comput..

[21]  Matthias Bethge,et al.  Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics , 2017, ECCV.

[22]  Thomas Brox,et al.  U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[23]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[24]  Rainer Goebel,et al.  Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[25]  Takao Yamanaka,et al.  Fully Convolutional DenseNet for Saliency-Map Prediction , 2017, 2017 4th IAPR Asian Conference on Pattern Recognition (ACPR).

[26]  Lourdes Agapito,et al.  Structured Uncertainty Prediction Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[27]  Takao Yamanaka,et al.  Influence of Image Classification Accuracy on Saliency Map Estimation , 2018, CAAI Trans. Intell. Technol..

[28]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[29]  Matthew W. Crocker,et al.  Visual attention in spoken human-robot interaction , 2009, 2009 4th ACM/IEEE International Conference on Human-Robot Interaction (HRI).

[30]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[31]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Rita Cucchiara,et al.  A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[33]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Simone Frintrop,et al.  Most salient region tracking , 2009, 2009 IEEE International Conference on Robotics and Automation.

[35]  Junwei Han,et al.  A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection , 2016, IEEE Transactions on Image Processing.

[36]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[37]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[38]  Yizhou Yu,et al.  Visual saliency based on multiscale deep features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).