Spatio-Temporal Self-Attention Network for Video Saliency Prediction

3D convolutional neural networks have achieved promising results for video tasks in computer vision, including video saliency prediction that is explored in this paper. However, 3D convolution encodes visual representation merely on fixed local spacetime according to its kernel size, while human attention is always attracted by relational visual features at different time of a video. To overcome this limitation, we propose a novel Spatio-Temporal Self-Attention 3D Network (STSANet) for video saliency prediction, in which multiple Spatio-Temporal Self-Attention (STSA) modules are employed at different levels of 3D convolutional backbone to directly capture long-range relations between spatio-temporal features of different time steps. Besides, we propose an Attentional Multi-Scale Fusion (AMSF) module to integrate multi-level features with the perception of context in semantic and spatio-temporal subspaces. Extensive experiments demonstrate the contributions of key components of our method, and the results on DHF1K, Hollywood-2, UCF, and DIEM benchmark datasets clearly prove the superiority of the proposed model compared with all state-of-the-art models.

[1]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[2]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[3]  Linwei Ye,et al.  Cross-Modal Weighting Network for RGB-D Salient Object Detection , 2020, ECCV.

[4]  Haibin Ling,et al.  Salient Object Detection in the Deep Learning Era: An In-Depth Survey , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Weiguo Fan,et al.  Re-Caption: Saliency-Enhanced Image Captioning Through Two-Phase Learning , 2020, IEEE Transactions on Image Processing.

[6]  Ling Shao,et al.  An Iterative and Cooperative Top-Down and Bottom-Up Inference Network for Salient Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ling Shao,et al.  See More, Know More: Unsupervised Video Object Segmentation With Co-Attention Siamese Networks , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[9]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[10]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[12]  Lina J. Karam,et al.  Visual Saliency Prediction Using a Mixture of Deep Neural Networks , 2017, IEEE Transactions on Image Processing.

[13]  Xiaoning Qian,et al.  Deep Co-Saliency Detection via Stacked Autoencoder-Enabled Fusion and Self-Trained CNNs , 2020, IEEE Transactions on Multimedia.

[14]  Jing Liu,et al.  Temporal Memory Attention for Video Semantic Segmentation , 2021, 2021 IEEE International Conference on Image Processing (ICIP).

[15]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[17]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[19]  Xiao Wang,et al.  cmSalGAN: RGB-D Salient Object Detection With Cross-View Generative Adversarial Networks , 2021, IEEE Transactions on Multimedia.

[20]  Wei Liu,et al.  Improving Video Saliency Detection via Localized Estimation and Spatiotemporal Refinement , 2018, IEEE Transactions on Multimedia.

[21]  Abhinav Gupta,et al.  Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[25]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[26]  Ning Xu,et al.  Video Object Segmentation Using Space-Time Memory Networks , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[27]  Kyle Min,et al.  TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[28]  Shijian Lu,et al.  Salient Object Detection by Fusing Local and Global Contexts , 2021, IEEE Transactions on Multimedia.

[29]  Huan Du,et al.  Depth-Aware Salient Object Detection and Segmentation via Multiscale Discriminative Saliency Fusion and Bootstrap Learning , 2017, IEEE Transactions on Image Processing.

[30]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[32]  Zhenzhong Chen,et al.  Attentive Cross-Modal Fusion Network for RGB-D Saliency Detection , 2020, IEEE Transactions on Multimedia.

[33]  Ruigang Yang,et al.  Saliency-Aware Video Object Segmentation , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Haibin Ling,et al.  ICNet: Information Conversion Network for RGB-D Based Salient Object Detection , 2020, IEEE Transactions on Image Processing.

[35]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[36]  Vineet Gandhi,et al.  Tidying Deep Saliency Prediction Architectures , 2020, 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[37]  Ruigang Yang,et al.  Inferring Salient Objects from Human Fixations , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Qingming Huang,et al.  Learning Coupled Convolutional Networks Fusion for Video Saliency Prediction , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[39]  Weisi Lin,et al.  Hierarchical Alternate Interaction Network for RGB-D Salient Object Detection , 2021, IEEE Transactions on Image Processing.

[40]  Weisi Lin,et al.  A Dilated Inception Network for Visual Saliency Prediction , 2019, IEEE Transactions on Multimedia.

[41]  Chong Peng,et al.  Salient Object Detection via Multiple Instance Joint Re-Learning , 2020, IEEE Transactions on Multimedia.

[42]  Patrick Le Callet,et al.  A coherent computational approach to model bottom-up visual attention , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Zhenzhong Chen,et al.  Video Saliency Prediction Based on Spatial-Temporal Two-Stream Network , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  Junwei Han,et al.  A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection , 2016, IEEE Transactions on Image Processing.

[45]  Andrew Zisserman,et al.  Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[47]  Qingshan Liu,et al.  Video Saliency Prediction Using Enhanced Spatiotemporal Alignment Network , 2020, Pattern Recognit..

[48]  Rita Cucchiara,et al.  A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[49]  Linwei Ye,et al.  Saliency Detection for Unconstrained Videos Using Superpixel-Level Graph and Spatiotemporal Propagation , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[50]  Sun-Yuan Kung,et al.  Salient Object Detection via Fuzzy Theory and Object-Level Enhancement , 2019, IEEE Transactions on Multimedia.

[51]  James J. Clark,et al.  Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[52]  Zhou Wang,et al.  Video saliency incorporating spatiotemporal cues and uncertainty weighting , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Shiping Zhu,et al.  Temporal-Spatial Feature Pyramid for Video Saliency Detection , 2021, ArXiv.

[55]  Ali Borji,et al.  Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study , 2013, IEEE Transactions on Image Processing.

[56]  Rainer Goebel,et al.  Contextual Encoder-Decoder Network for Visual Saliency Prediction , 2019, Neural Networks.

[57]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[58]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[59]  Steven C. H. Hoi,et al.  Paying Attention to Video Object Pattern Understanding , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[61]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62]  Song Wang,et al.  SalSAC: A Video Saliency Prediction Model with Shuffled Attentions and Correlation-Based ConvLSTM , 2020, AAAI.

[63]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[64]  Vineet Gandhi,et al.  ViNet: Pushing the limits of Visual Modality for Audio-Visual Saliency Prediction , 2020, 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

[65]  Chen Sun,et al.  Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification , 2017, ECCV.

[66]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[67]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[68]  Leon A. Gatys,et al.  Understanding Low- and High-Level Contributions to Fixation Prediction , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[69]  Naila Murray,et al.  End-to-End Saliency Mapping via Probability Distribution Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[70]  Richard Socher,et al.  Dynamic Coattention Networks For Question Answering , 2016, ICLR.

[71]  Jianfei Cai,et al.  Image Co-segmentation via Saliency Co-fusion , 2016, IEEE Transactions on Multimedia.

[72]  Wei Zhang,et al.  An Adaptive Computational Model for Salient Object Detection , 2010, IEEE Transactions on Multimedia.

[73]  Mengke Huang,et al.  Personal Fixations-Based Object Segmentation With Object Localization and Boundary Preservation , 2020, IEEE Transactions on Image Processing.

[74]  Zhi Liu,et al.  SalED: Saliency prediction with a pithy encoder-decoder architecture sensing local and global information , 2021, Image Vis. Comput..

[75]  Zhenzhong Chen,et al.  A Spatial-Temporal Recurrent Neural Network for Video Saliency Prediction , 2020, IEEE Transactions on Image Processing.

[76]  Wenguan Wang,et al.  Shifting More Attention to Video Salient Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[77]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[78]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[79]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[80]  Xiongkuo Min,et al.  How is Gaze Influenced by Image Transformations? Dataset and Model , 2019, IEEE Transactions on Image Processing.

[81]  Zhi Liu,et al.  Constrained fixation point based segmentation via deep neural network , 2019, Neurocomputing.

[82]  Simone Palazzo,et al.  Hierarchical Domain-Adapted Feature Learning for Video Saliency Prediction , 2020, International Journal of Computer Vision.

[83]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[84]  Ivan V. Bajic,et al.  Saliency-Aware Video Compression , 2014, IEEE Transactions on Image Processing.

[85]  Noel E. O'Connor,et al.  Simple vs complex temporal recurrences for video saliency prediction , 2019, BMVC.

[86]  Enhua Wu,et al.  Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[87]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[88]  Hanqiu Sun,et al.  Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks , 2020, IEEE Transactions on Image Processing.