Video salient object detection via spatiotemporal attention neural networks

Abstract Recently, deep convolutional neural networks have been widely introduced into image salient object detection and achieve good performance in this community. However, as the complexity of video scenes, video salient object detection with deep learning models is still a challenge. The specific difficulties come from two aspects. First of all, the deep networks on image saliency detection cannot capture robust motion cues in video sequences. Secondly, as for the spatiotemporal fusing features, the existing methods simply exploit element-wise addition or concatenation, which not fully explores the contextual information and complementary correlation, thus they cannot produce more robust spatiotemporal features. To address these issues, we propose a two-stream based spatiotemporal attention neural network (STAN) for video salient object detection. We amply extract the motion information in terms of long short term memory (LSTM) network and 3D convolutional operation from optical flow-based prior and video sequences. Moreover, an attentive module is designed to integrate the different types of spatiotemporal feature maps by learning the corresponding weights. Meanwhile, in order to generate sufficient pixel-wise annotated video frames, we manually generate lots of coarse labels, which are well utilized to train a robust saliency prediction network. Experiments on the widely used challenging datasets (e.g., FBMS and DAVIS) prove that the proposed STAN has competitive performances among salient object detection methods.

[1]  Feng Zhou,et al.  Time-Mapping Using Space-Time Saliency , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[2]  Yunchao Wei,et al.  STC: A Simple to Complex Framework for Weakly-Supervised Semantic Segmentation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[4]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Xiaogang Wang,et al.  Saliency detection by multi-context deep learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Ling Shao,et al.  Video Saliency Detection Using Object Proposals , 2018, IEEE Transactions on Cybernetics.

[7]  Huchuan Lu,et al.  Learning to Detect Salient Objects with Image-Level Supervision , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Yael Pritch,et al.  Saliency filters: Contrast based filtering for salient region detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Jitendra Malik,et al.  Object Segmentation by Long Term Analysis of Point Trajectories , 2010, ECCV.

[10]  Tat-Seng Chua,et al.  SCA-CNN: Spatial and Channel-Wise Attention in Convolutional Networks for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Wei Wu,et al.  End-to-End Flow Correlation Tracking with Spatial-Temporal Attention , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[12]  Kristen Grauman,et al.  FusionSeg: Learning to Combine Motion and Appearance for Fully Automatic Segmentation of Generic Objects in Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Ali Douik,et al.  A spatiotemporal model for video saliency detection , 2016, 2016 International Image Processing, Applications and Systems (IPAS).

[14]  Luc Van Gool,et al.  A Benchmark Dataset and Evaluation Methodology for Video Object Segmentation , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yunchao Wei,et al.  STA: Spatial-Temporal Attention for Large-Scale Video-based Person Re-Identification , 2018, AAAI.

[16]  Yizhou Yu,et al.  Deep Contrast Learning for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Haibin Ling,et al.  A Deep Network Solution for Attention and Aesthetics Aware Photo Cropping , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Xiaowu Chen,et al.  A Benchmark Dataset and Saliency-Guided Stacked Autoencoders for Video-Based Salient Object Detection , 2016, IEEE Transactions on Image Processing.

[19]  Luc Van Gool,et al.  One-Shot Video Object Segmentation , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Huchuan Lu,et al.  Salient object detection via bootstrap learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Huchuan Lu,et al.  Learning Uncertain Convolutional Features for Accurate Saliency Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[22]  Xiaochun Cao,et al.  Cluster-Based Co-Saliency Detection , 2013, IEEE Transactions on Image Processing.

[23]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24]  Daniel P. Huttenlocher,et al.  Efficient Graph-Based Image Segmentation , 2004, International Journal of Computer Vision.

[25]  Yu Cheng,et al.  Jointly Attentive Spatial-Temporal Pooling Networks for Video-Based Person Re-identification , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[26]  Tao Mei,et al.  Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Jiasen Lu,et al.  Hierarchical Question-Image Co-Attention for Visual Question Answering , 2016, NIPS.

[28]  Junwei Han,et al.  DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Han Wang,et al.  Salient Object Detection With Spatiotemporal Background Priors for Video , 2017, IEEE Transactions on Image Processing.

[30]  Jing Liu,et al.  Robust Structured Subspace Learning for Data Representation , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  Xia Li,et al.  Weakly Supervised Salient Object Detection With Spatiotemporal Cascade Neural Networks , 2019, IEEE Transactions on Circuits and Systems for Video Technology.

[32]  Zheng Wang,et al.  A deep-learning based feature hybrid framework for spatiotemporal saliency detection inside videos , 2018, Neurocomputing.

[33]  Ling Shao,et al.  Higher Order Energies for Image Segmentation , 2017, IEEE Transactions on Image Processing.

[34]  Huchuan Lu,et al.  Saliency detection via Cellular Automata , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[35]  Ali Borji,et al.  Salient Object Detection Driven by Fixation Prediction , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Huchuan Lu,et al.  Saliency Detection with Recurrent Fully Convolutional Networks , 2016, ECCV.

[37]  Zhuowen Tu,et al.  Deeply Supervised Salient Object Detection with Short Connections , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  Christopher Joseph Pal,et al.  Describing Videos by Exploiting Temporal Structure , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Wenjun Zeng,et al.  An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data , 2016, AAAI.

[40]  Xia Li,et al.  Multi-Scale Spatiotemporal Conv-LSTM Network for Video Saliency Detection , 2018, ICMR.

[41]  Huchuan Lu,et al.  Deep networks for saliency detection via local estimation and global search , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Kwan-Liu Ma,et al.  Stereoscopic Thumbnail Creation via Efficient Stereo Saliency Detection , 2017, IEEE Transactions on Visualization and Computer Graphics.

[43]  Ming-Hsuan Yang,et al.  SegFlow: Joint Learning for Video Object Segmentation and Optical Flow , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[44]  Michael J. Black,et al.  A Quantitative Analysis of Current Practices in Optical Flow Estimation and the Principles Behind Them , 2013, International Journal of Computer Vision.

[45]  Ling Shao,et al.  Hyperparameter Optimization for Tracking with Continuous Deep Q-Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[46]  Yuan Xie,et al.  Flow Guided Recurrent Neural Encoder for Video Salient Object Detection , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[47]  Sanyuan Zhao,et al.  Pyramid Dilated Deeper ConvLSTM for Video Salient Object Detection , 2018, ECCV.

[48]  Iasonas Kokkinos,et al.  DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Trung-Nghia Le,et al.  Video Salient Object Detection Using Spatiotemporal Deep Features , 2017, IEEE Transactions on Image Processing.

[50]  James M. Rehg,et al.  Video Segmentation by Tracking Many Figure-Ground Segments , 2013, 2013 IEEE International Conference on Computer Vision.

[51]  Xiaogang Jin,et al.  Quadruplet Network With One-Shot Learning for Fast Visual Object Tracking , 2017, IEEE Transactions on Image Processing.

[52]  Nanning Zheng,et al.  Automatic salient object segmentation based on context and shape prior , 2011, BMVC.

[53]  Hong Qin,et al.  Video Saliency Detection via Spatial-Temporal Fusion and Low-Rank Coherency Diffusion , 2017, IEEE Transactions on Image Processing.

[54]  Louis-Philippe Morency,et al.  Temporal Attention-Gated Model for Robust Sequence Classification , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[56]  Tao Mei,et al.  Multi-level Attention Networks for Visual Question Answering , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[57]  Ling Shao,et al.  Submodular Trajectories for Better Motion Segmentation in Videos , 2018, IEEE Transactions on Image Processing.

[58]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[59]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[60]  Yizhou Yu,et al.  Visual Saliency Detection Based on Multiscale Deep CNN Features , 2016, IEEE Transactions on Image Processing.

[61]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[62]  Ling Shao,et al.  Video Co-Saliency Guided Co-Segmentation , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[63]  Thomas Brox,et al.  Motion Trajectory Segmentation via Minimum Cost Multicuts , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[64]  Vittorio Ferrari,et al.  Fast Object Segmentation in Unconstrained Video , 2013, 2013 IEEE International Conference on Computer Vision.

[65]  Karteek Alahari,et al.  Learning Motion Patterns in Videos , 2016, CVPR.

[66]  Ruigang Yang,et al.  Semi-Supervised Video Object Segmentation with Super-Trajectories , 2019, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  Thomas Brox,et al.  FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[68]  Yuan Xie,et al.  Instance-Level Salient Object Segmentation , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[69]  Bastian Leibe,et al.  Online Adaptation of Convolutional Neural Networks for Video Object Segmentation , 2017, BMVC.

[70]  Xia Li,et al.  Kernalised Multi-resolution Convnet for Visual Tracking , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[71]  Nanning Zheng,et al.  Salient Object Detection: A Discriminative Regional Feature Integration Approach , 2013, International Journal of Computer Vision.

[72]  Zhi Liu,et al.  Segmentation Driven Low-rank Matrix Recovery for Saliency Detection , 2013, BMVC.

[73]  Huchuan Lu,et al.  Amulet: Aggregating Multi-level Convolutional Features for Salient Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[74]  Yueting Zhuang,et al.  DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection , 2015, IEEE Transactions on Image Processing.

[75]  Xiang Zhang,et al.  Superpixel-Based Spatiotemporal Saliency Detection , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[76]  Yuxin Peng,et al.  Two-Stream Collaborative Learning With Spatial-Temporal Attention for Video Classification , 2017, IEEE Transactions on Circuits and Systems for Video Technology.