论文信息 - CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

CASNet: A Cross-Attention Siamese Network for Video Salient Object Detection

Recent works on video salient object detection have demonstrated that directly transferring the generalization ability of image-based models to video data without modeling spatial-temporal information remains nontrivial and challenging. Considering both intraframe accuracy and interframe consistency of saliency detection, this article presents a novel cross-attention based encoder-decoder model under the Siamese framework (CASNet) for video salient object detection. A baseline encoder-decoder model trained with Lovász softmax loss function is adopted as a backbone network to guarantee the accuracy of intraframe salient object detection. Self- and cross-attention modules are incorporated into our model in order to preserve the saliency correlation and improve intraframe salient detection consistency. Extensive experimental results obtained by ablation analysis and cross-data set validation demonstrate the effectiveness of our proposed method. Quantitative results indicate that our CASNet model outperforms 19 state-of-the-art image- and video-based methods on six benchmark data sets.

[1] Thomas Brox,et al. FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[4] Andrew Zisserman,et al. Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[5] Han Zhang,et al. Self-Attention Generative Adversarial Networks , 2018, ICML.

[6] Huchuan Lu,et al. Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[7] Linwei Ye,et al. Saliency Detection for Unconstrained Videos Using Superpixel-Level Graph and Spatiotemporal Propagation , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[8] Yuning Jiang,et al. Unified Perceptual Parsing for Scene Understanding , 2018, ECCV.

[9] Huchuan Lu,et al. A Stagewise Refinement Model for Detecting Salient Objects in Images , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[10] Qingming Huang,et al. Stacked Cross Refinement Network for Edge-Aware Salient Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[11] Gayoung Lee,et al. Deep Saliency with Encoded Low Level Distance Map and High Level Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12] Xiaogang Wang,et al. Pyramid Scene Parsing Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13] Ling Shao,et al. Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[14] Ling Shao,et al. Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[15] Xiaogang Wang,et al. Residual Attention Network for Image Classification , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Thomas Brox,et al. U-Net: Convolutional Networks for Biomedical Image Segmentation , 2015, MICCAI.

[18] Ali Borji,et al. Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[19] Abhinav Gupta,et al. Non-local Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[20] Bo Ren,et al. Enhanced-alignment Measure for Binary Foreground Map Evaluation , 2018, IJCAI.

[21] Gang Wang,et al. Progressive Attention Guided Recurrent Network for Salient Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[22] Huchuan Lu,et al. Learning Uncertain Convolutional Features for Accurate Saliency Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[23] Jingdong Wang,et al. Salient Object Detection: A Discriminative Regional Feature Integration Approach , 2013, International Journal of Computer Vision.

[24] Hong Qin,et al. Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion , 2017, IEEE Transactions on Image Processing.

[25] James M. Rehg,et al. Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[26] Ali Borji,et al. Salient object detection: A survey , 2014, Computational Visual Media.

[27] Jingdong Wang,et al. OCNet: Object Context Network for Scene Parsing , 2018, ArXiv.

[28] Luc Van Gool,et al. A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29] Jitendra Malik,et al. Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[30] Yizhou Yu,et al. Visual Saliency Detection Based on Multiscale Deep CNN Features , 2016, IEEE Transactions on Image Processing.

[31] Mubarak Shah,et al. Video Object Segmentation through Spatially Accurate and Temporally Dense Extraction of Primary Object Regions , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[32] Dit-Yan Yeung,et al. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[33] Xiaowu Chen,et al. A Benchmark Dataset and Saliency-Guided Stacked Autoencoders for Video-Based Salient Object Detection , 2016, IEEE Transactions on Image Processing.

[34] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[35] Tianming Liu,et al. Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36] Changqun Xia,et al. Selectivity or Invariance: Boundary-Aware Salient Object Detection , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[37] Wenguan Wang,et al. Shifting More Attention to Video Salient Object Detection , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38] Q. M. Jonathan Wu,et al. Salient object detection via multi-scale attention CNN , 2018, Neurocomputing.

[39] Iasonas Kokkinos,et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[40] Iasonas Kokkinos,et al. Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[41] Yuan Xie,et al. Flow Guided Recurrent Neural Encoder for Video Salient Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42] Kun Yu,et al. DenseASPP for Semantic Segmentation in Street Scenes , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43] Jakob Uszkoreit,et al. A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[44] Enhua Wu,et al. Squeeze-and-Excitation Networks , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45] Yizhou Yu,et al. Deep Contrast Learning for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46] Zulin Wang,et al. Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM , 2017, ECCV.

[47] Mirella Lapata,et al. Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[48] Tat-Seng Chua,et al. SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49] Jian Sun,et al. Saliency Optimization from Robust Background Detection , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50] Zhuowen Tu,et al. Deeply Supervised Salient Object Detection with Short Connections , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51] Rita Cucchiara,et al. Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[52] Matthew B. Blaschko,et al. The Lovasz-Softmax Loss: A Tractable Surrogate for the Optimization of the Intersection-Over-Union Measure in Neural Networks , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53] Ali Borji,et al. Salient Object Detection Driven by Fixation Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[54] Richard Socher,et al. A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[55] Alex Graves,et al. Recurrent Models of Visual Attention , 2014, NIPS.

[56] Junwei Han,et al. DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57] Nanning Zheng,et al. Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[58] Jun Fu,et al. Dual Attention Network for Scene Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59] Sanyuan Zhao,et al. Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[60] Jiangjiang Liu,et al. Salient Objects in Clutter: Bringing Salient Object Detection to the Foreground , 2018, ECCV.

[61] Huchuan Lu,et al. Saliency detection via Cellular Automata , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[62] Huchuan Lu,et al. Saliency Detection via Graph-Based Manifold Ranking , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[63] Ming-Hsuan Yang,et al. PiCANet: Learning Pixel-Wise Contextual Attention for Saliency Detection , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[64] Tao Li,et al. Structure-Measure: A New Way to Evaluate Foreground Maps , 2017, International Journal of Computer Vision.

[65] Huchuan Lu,et al. Learning to Detect Salient Objects with Image-Level Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66] Trevor Darrell,et al. Fully Convolutional Networks for Semantic Segmentation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67] Radomír Mech,et al. Minimum Barrier Salient Object Detection at 80 FPS , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[68] Yichen Wei,et al. Deep Feature Flow for Video Recognition , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).