论文信息 - TranSalNet: Visual saliency prediction using transformers

TranSalNet: Visual saliency prediction using transformers

Convolutional neural networks (CNNs) have significantly advanced computational modeling for saliency prediction. However, the inherent inductive biases of convolutional architectures cause insufficient long-range contextual encoding capacity, which potentially makes a saliency model less humanlike. Transformers have shown great potential in encoding long-range information by leveraging the self-attention mechanism. In this paper, we propose a novel saliency model integrating transformer components to CNNs to capture the long-range contextual information. Experimental results show that the new components make improvements, and the proposed model achieves promising results in predicting saliency.

[1] Mark Chen,et al. Generative Pretraining From Pixels , 2020, ICML.

[2] Dumitru Erhan,et al. Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[4] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5] Mohan S. Kankanhalli,et al. Emotional Attention: A Study of Image Sentiment and Visual Attention , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[6] Noel E. O'Connor,et al. SalGAN: Visual Saliency Prediction with Generative Adversarial Networks , 2017, ArXiv.

[7] Rainer Goebel,et al. Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[8] Frédo Durand,et al. Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[9] P. Perona,et al. Objects predict fixations better than early saliency. , 2008, Journal of vision.

[10] C. Koch,et al. Faces and text attract gaze independent of the task: Experimental data and computer model. , 2009, Journal of vision.

[11] David E. Irwin,et al. Integrating visual information from successive fixations. , 1982, Science.

[12] Sen Jia,et al. EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction , 2018, Image Vis. Comput..

[13] Wei Zhang,et al. A Saliency Dispersion Measure for Improving Saliency-Based Image Quality Metrics , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[14] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15] Wenjun Zhang,et al. Automatic Contrast Enhancement Technology With Saliency Preservation , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[16] Qi Zhao,et al. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[17] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[18] Christof Koch,et al. Modeling attention to salient proto-objects , 2006, Neural Networks.

[19] Pietro Perona,et al. Graph-Based Visual Saliency , 2006, NIPS.

[20] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[21] Aykut Erdem,et al. Visual saliency estimation by nonlinearly integrating features using region covariances. , 2013, Journal of vision.

[22] Georg Heigold,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2021, ICLR.

[23] Ali Borji,et al. CAT2000: A Large Scale Fixation Dataset for Boosting Saliency Research , 2015, ArXiv.

[24] Rita Cucchiara,et al. Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[25] Rita Cucchiara,et al. A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[26] Leon A. Gatys,et al. Understanding Low- and High-Level Contributions to Fixation Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[27] Ali Borji,et al. Saliency Prediction in the Deep Learning Era: Successes and Limitations , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28] Frédo Durand,et al. What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29] Qi Zhao,et al. SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30] Xiongkuo Min,et al. How is Gaze Influenced by Image Transformations? Dataset and Model , 2019, IEEE Transactions on Image Processing.

[31] W. Einhäuser,et al. Overt attention in natural scenes: Objects dominate features , 2015, Vision Research.

[32] Yafei Song,et al. A Data-Driven Metric for Comprehensive Evaluation of Saliency Models , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[34] Wenguan Wang,et al. Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[35] Matthias Bethge,et al. Saliency Benchmarking Made Easy: Separating Models, Maps and Metrics , 2017, ECCV.

[36] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37] Manoranjan Paul,et al. Spatial and Motion Saliency Prediction Method Using Eye Tracker Data for Video Summarization , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[38] Hantao Liu,et al. A Measurement for Distortion Induced Saliency Variation in Natural Images , 2021, IEEE Transactions on Instrumentation and Measurement.

[39] Frédo Durand,et al. A Benchmark of Computational Models of Saliency to Predict Human Fixations , 2012 .

[40] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41] Michael Dorr,et al. Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.