Video salient object detection using dual-stream spatiotemporal attention

Abstract Video salient object detection plays an important role in many exciting applications in different areas. However, the existing deep learning-based video salient object detection methods still struggle in scenes of large salient object variabilities and great background scene diversity between and within frames. In this paper, we propose a dual-stream spatiotemporal attention network (DSSANet) for saliency detection in videos. It creatively introduces a multiplex attention mechanism to effectively extract and fuse spatiotemporal features of video salient object over frames in the video, thereby improving saliency detection performance. The DSSANet consists of: (1) A context feature path leverages a novel attention-augmented convolutional LSTM to effectively model the long-range dependency of the great temporal variation in the salient object over frames. (2) A content feature path creatively leverages an attention-based 1D dilated convolution to effectively model the local pixel correlation structure of each pixel in the salient object and the surrounding objects. (3) A refinement fusion module fuses these two features from their paths and further refines the fused feature by an attention-based feature selection. By integrating these three parts, DSSANet accurately detects the salient object from the video. The extensive experiments are performed on four public datasets and demonstrate the effectiveness of DSSANet and the superiority to five state-of-the-art video salient object detection methods.

[1]  Huchuan Lu,et al.  Saliency Detection via Absorbing Markov Chain , 2013, 2013 IEEE International Conference on Computer Vision.

[2]  Rizard Renanda Adhi Pramono,et al.  Hierarchical Self-Attention Network for Action Localization in Videos , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[3]  Hong Ren Wu,et al.  Smart video surveillance system , 2010, 2010 IEEE International Conference on Industrial Technology.

[4]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[5]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[6]  Trevor Darrell,et al.  Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation , 2013, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  Jingdong Wang,et al.  Salient Object Detection: A Discriminative Regional Feature Integration Approach , 2013, International Journal of Computer Vision.

[8]  Q. M. Jonathan Wu,et al.  Salient object detection via multi-scale attention CNN , 2018, Neurocomputing.

[9]  Heye Zhang,et al.  IoT-based 3D convolution for video salient object detection , 2019, Neural Computing and Applications.

[10]  Mei Han,et al.  Category-Independent Object-Level Saliency Detection , 2013, 2013 IEEE International Conference on Computer Vision.

[11]  Frank K. Soong,et al.  A deep bidirectional LSTM approach for video-realistic talking head , 2016, Multimedia Tools and Applications.

[12]  Huafeng Liu,et al.  Direct delineation of myocardial infarction without contrast agents using a joint motion feature learning architecture , 2018, Medical Image Anal..

[13]  Wenhui Li,et al.  Self-attention recurrent network for saliency detection , 2018, Multimedia Tools and Applications.

[14]  Heng Tao Shen,et al.  Beyond Frame-level CNN: Saliency-Aware 3-D CNN With LSTM for Video Action Recognition , 2017, IEEE Signal Processing Letters.

[15]  Zheng Fang,et al.  Dense dilation network for saliency detection , 2019, International Conference on Graphic and Image Processing.

[16]  Trung-Nghia Le,et al.  Contrast Based Hierarchical Spatial-Temporal Saliency for Video , 2015, PSIVT.

[17]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[18]  Junji Yamato,et al.  Saliency-based video segmentation with graph cuts and sequentially updated priors , 2009, 2009 IEEE International Conference on Multimedia and Expo.

[19]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ali Borji,et al.  Salient Object Detection: A Benchmark , 2015, IEEE Transactions on Image Processing.

[22]  Khan Muhammad,et al.  Cost-Effective Video Summarization Using Deep CNN With Hierarchical Weighted Fusion for IoT Surveillance Networks , 2020, IEEE Internet of Things Journal.

[23]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[24]  Xiaowu Chen,et al.  A Benchmark Dataset and Saliency-Guided Stacked Autoencoders for Video-Based Salient Object Detection , 2016, IEEE Transactions on Image Processing.

[25]  Dong Liang,et al.  Motion Tracking of the Carotid Artery Wall From Ultrasound Image Sequences: a Nonlinear State-Space Approach , 2018, IEEE Transactions on Medical Imaging.

[26]  Khan Muhammad,et al.  DeepReS: A Deep Learning-Based Video Summarization Strategy for Resource-Constrained Industrial Surveillance Scenarios , 2020, IEEE Transactions on Industrial Informatics.

[27]  Ali Borji,et al.  Salient object detection: A survey , 2014, Computational Visual Media.

[28]  Yizhou Yu,et al.  Motion Guided Attention for Video Salient Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[29]  Sabine Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[30]  Yuan Xie,et al.  Flow Guided Recurrent Neural Encoder for Video Salient Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[31]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[33]  Jitendra Malik,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence Segmentation of Moving Objects by Long Term Video Analysis , 2022 .

[34]  Jason Weston,et al.  A Neural Attention Model for Abstractive Sentence Summarization , 2015, EMNLP.

[35]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Irfan Mehmood,et al.  Edge Intelligence-Assisted Smoke Detection in Foggy Surveillance Environments , 2020, IEEE Transactions on Industrial Informatics.

[37]  Qi Tian,et al.  Salient region detection and its application to video retargeting , 2011, 2011 IEEE International Conference on Multimedia and Expo.

[38]  Trung-Nghia Le,et al.  Video Salient Object Detection Using Spatiotemporal Deep Features , 2017, IEEE Transactions on Image Processing.

[39]  Nitish Srivastava,et al.  Exploiting Image-trained CNN Architectures for Unconstrained Video Classification , 2015, BMVC.

[40]  Wenbin Zou,et al.  Video salient object detection via spatiotemporal attention neural networks , 2020, Neurocomputing.

[41]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[42]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).