Learning to Explore Saliency for Stereoscopic Videos Via Component-Based Interaction

In this paper, we devise a saliency prediction model for stereoscopic videos that learns to explore saliency inspired by the component-based interactions including spatial, temporal, as well as depth cues. The model first takes advantage of specific structure of 3D residual network (3D-ResNet) to model the saliency driven by spatio-temporal coherence from consecutive frames. Subsequently, the saliency inferred by implicit-depth is automatically derived based on the displacement correlation between left and right views by leveraging a deep convolutional network (ConvNet). Finally, a component-wise refinement network is devised to produce final saliency maps over time by aggregating saliency distributions obtained from multiple components. In order to further facilitate research towards stereoscopic video saliency, we create a new dataset including 175 stereoscopic video sequences with diverse content, as well as their dense eye fixation annotations. Extensive experiments support that our proposed model can achieve superior performance compared to the state-of-the-art methods on all publicly available eye fixation datasets.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Sanghoon Lee,et al.  Deep Visual Saliency on Stereoscopic Images , 2019, IEEE Transactions on Image Processing.

[3]  Frédo Durand,et al.  What Do Different Evaluation Metrics Tell Us About Saliency Models? , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Jing Li,et al.  Visual Attention Modeling for Stereoscopic Video: A Benchmark and Computational Model , 2017, IEEE Transactions on Image Processing.

[5]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[6]  Sam Kwong,et al.  Learning to Explore Intrinsic Saliency for Stereoscopic Video , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Jia Li,et al.  Deep3DSaliency: Deep Stereoscopic Video Saliency Detection Model by 3D Convolutional Networks , 2019, IEEE Transactions on Image Processing.

[8]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[9]  E. Niebur,et al.  A proto-object based saliency model in three-dimensional space , 2016, Vision Research.

[10]  A. Parker Binocular depth perception and the cerebral cortex , 2007, Nature Reviews Neuroscience.

[11]  Xin He,et al.  Cross-View Multi-Lateral Filter for Compressed Multi-View Depth Video , 2019, IEEE Transactions on Image Processing.

[12]  Qiong Liu,et al.  A Two-Stage Clustering Based 3D Visual Saliency Model for Dynamic Scenarios , 2019, IEEE Transactions on Multimedia.

[13]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[14]  Xu Wang,et al.  Deep Learning Features Inspired Saliency Detection of 3D Images , 2016, PCM.

[15]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[16]  Weisi Lin,et al.  Saliency-Guided Quality Assessment of Screen Content Images , 2016, IEEE Transactions on Multimedia.

[17]  Ali Borji,et al.  An Object-Based Bayesian Framework for Top-Down Visual Attention , 2012, AAAI.

[18]  Dmitriy Vatolin,et al.  Semiautomatic visual-attention modeling and its application to video compression , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[19]  R. Watt Scanning from coarse to fine spatial scales in the human visual system after the onset of a stimulus. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[20]  Kwan-Liu Ma,et al.  Stereoscopic Thumbnail Creation via Efficient Stereo Saliency Detection , 2017, IEEE Transactions on Visualization and Computer Graphics.

[21]  Xin Du,et al.  Learning Stereoscopic Visual Attention Model for 3D Video , 2015, 2015 International Conference on Computer Science and Applications (CSA).

[22]  Kyle Min,et al.  TASED-Net: Temporally-Aggregating Spatial Encoder-Decoder Network for Video Saliency Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[23]  Ming-Hsuan Yang,et al.  Top-down visual saliency via joint CRF and dictionary learning , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[24]  Hanqiu Sun,et al.  Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks , 2020, IEEE Transactions on Image Processing.

[25]  Aidong Men,et al.  Video saliency detection incorporating temporal information in compressed domain , 2015, Signal Process. Image Commun..

[26]  Junle Wang,et al.  Computational Model of Stereoscopic 3D Visual Saliency , 2013, IEEE Transactions on Image Processing.

[27]  Christel Chamaret,et al.  Adaptive 3D rendering based on region-of-interest , 2010, Electronic Imaging.

[28]  Jianjun Lei,et al.  Visual Attention Prediction for Stereoscopic Video by Multi-Module Fully Convolutional Network , 2019, IEEE Transactions on Image Processing.

[29]  Ken Chen,et al.  Stereoscopic Visual Attention Model for 3D Video , 2010, MMM.

[30]  Kaiming He,et al.  Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[31]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[32]  John K. Tsotsos,et al.  An attentional framework for stereo vision , 2005, The 2nd Canadian Conference on Computer and Robot Vision (CRV'05).

[33]  Tie Liu,et al.  DeepVS: A Deep Learning Based Video Saliency Prediction Approach , 2018, ECCV.

[34]  Laurent Itti,et al.  Realistic avatar eye and head animation using a neurobiological model of visual attention , 2004, SPIE Optics + Photonics.

[35]  Xueqing Li,et al.  Leveraging stereopsis for saliency analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[36]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[37]  Panos Nasiopoulos,et al.  A learning-based visual saliency prediction model for stereoscopic 3D video (LBVS-3D) , 2016, Multimedia Tools and Applications.

[38]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[39]  Ali Borji,et al.  Exploiting local and global patch rarities for saliency detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[40]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  Seunghoon Hong,et al.  Learning Deconvolution Network for Semantic Segmentation , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[42]  Thomas Brox,et al.  FlowNet: Learning Optical Flow with Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[43]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[44]  Fatih Murat Porikli,et al.  Saliency-aware geodesic video object segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[46]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[47]  Harish Katti,et al.  Depth Matters: Influence of Depth Cues on Visual Saliency , 2012, ECCV.

[48]  In-So Kweon,et al.  CBAM: Convolutional Block Attention Module , 2018, ECCV.

[49]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[50]  Junle Wang,et al.  An eye tracking database for stereoscopic video , 2014, 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX).

[51]  Lihi Zelnik-Manor,et al.  Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[52]  Lorenzo Torresani,et al.  Learning Spatiotemporal Features with 3D Convolutional Networks , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[53]  Haibin Ling,et al.  Revisiting Video Saliency Prediction in the Deep Learning Era , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Ruigang Yang,et al.  Inferring Salient Objects from Human Fixations , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  Alan C. Bovik,et al.  Saliency Prediction on Stereoscopic Videos , 2014, IEEE Transactions on Image Processing.