TranSalNet: Visual saliency prediction using transformers

Convolutional neural networks (CNNs) have significantly advanced computational modeling for saliency prediction. However, the inherent inductive biases of convolutional architectures cause insufficient long-range contextual encoding capacity, which potentially makes a saliency model less humanlike. Transformers have shown great potential in encoding long-range information by leveraging the self-attention mechanism. In this paper, we propose a novel saliency model integrating transformer components to CNNs to capture the long-range contextual information. Experimental results show that the new components make improvements, and the proposed model achieves promising results in predicting saliency.

[1]  Mark Chen,et al.  Generative Pretraining From Pixels , 2020, ICML.

[2]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Mohan S. Kankanhalli,et al.  Emotional Attention: A Study of Image Sentiment and Visual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6]  Noel E. O'Connor,et al.  SalGAN: Visual Saliency Prediction with Generative Adversarial Networks , 2017, ArXiv.

[7]  Rainer Goebel,et al.  Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[8]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[10]  C. Koch,et al.  Faces and text attract gaze independent of the task: Experimental data and computer model. , 2009, Journal of vision.

[11]  David E. Irwin,et al.  Integrating visual information from successive fixations. , 1982, Science.

[12]  Sen Jia,et al.  EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction , 2018, Image Vis. Comput..

[13]  Wei Zhang,et al.  A Saliency Dispersion Measure for Improving Saliency-Based Image Quality Metrics , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Wenjun Zhang,et al.  Automatic Contrast Enhancement Technology With Saliency Preservation , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[16]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18]  Christof Koch,et al.  Modeling attention to salient proto-objects , 2006, Neural Networks.

[19]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[20]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21]  Aykut Erdem,et al.  Visual saliency estimation by nonlinearly integrating features using region covariances. , 2013, Journal of vision.

[22]  Georg Heigold,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[23]  Ali Borji,et al.  CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research , 2015, ArXiv.

[24]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[25]  Rita Cucchiara,et al.  A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[26]  Leon A. Gatys,et al.  Understanding Low- and High-Level Contributions to Fixation Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27]  Ali Borji,et al.  Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Xiongkuo Min,et al.  How is Gaze Influenced by Image Transformations? Dataset and Model , 2019, IEEE Transactions on Image Processing.

[31]  W. Einhäuser,et al.  Overt attention in natural scenes: Objects dominate features , 2015, Vision Research.

[32]  Yafei Song,et al.  A Data-Driven Metric for Comprehensive Evaluation of Saliency Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[35]  Matthias Bethge,et al.  Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics , 2017, ECCV.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Manoranjan Paul,et al.  Spatial and Motion Saliency Prediction Method Using Eye Tracker Data for Video Summarization , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[38]  Hantao Liu,et al.  A Measurement for Distortion Induced Saliency Variation in Natural Images , 2021, IEEE Transactions on Instrumentation and Measurement.

[39]  Frédo Durand,et al.  A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[40]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.