Video saliency prediction via spatio-temporal reasoning

Abstract Video saliency detection often suffers from two issues: hard to disentangle the temporal motion patterns and spatial layout patterns, and hard to capture the temporal motion patterns. Thus a novel deep learning network architecture is proposed for video saliency in this paper. The proposed network consists of three parts: high-level representation module, attention module, and memory and reasoning module. The high-level representation module and attention module are used for capturing spatial saliency that is mainly learned from static images. The memory and reasoning module is used to infer the saliency from the information about spatial layout in frames and temporal motion between frames. Because high-level representation module and attention module could concentrate on high-level representation of spatial patterns, and the memory and reasoning module could concentrate on spatial and temporal saliency reasoning, the temporal patterns and spatial patterns could be disentangled efficiently. The quantitative and qualitative results show the proposed method achieves a promising results across a wide of metrics.

[1]  James J. Clark,et al.  Going from Image to Video Saliency: Augmenting Image Salience with Dynamic Attentional Push , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[2]  Zhou Wang,et al.  Video saliency incorporating spatiotemporal cues and uncertainty weighting , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[3]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[5]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[6]  Gong Cheng,et al.  P-CNN: Part-Based Convolutional Neural Networks for Fine-Grained Visual Categorization , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  Nicolas Riche,et al.  RARE2012: A multi-scale rarity-based saliency detection with its comparative statistical analysis , 2013, Signal Process. Image Commun..

[8]  Nicolas Riche,et al.  Saliency and Human Fixations: State-of-the-Art and Study of Comparison Metrics , 2013, 2013 IEEE International Conference on Computer Vision.

[9]  Nuno Vasconcelos,et al.  How many bits does it take for a stimulus to be salient? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[11]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[12]  Jorma Laaksonen,et al.  Exploiting inter-image similarity and ensemble of extreme learners for fixation prediction using deep features , 2016, Neurocomputing.

[13]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Chee Kheong Siew,et al.  Extreme learning machine: Theory and applications , 2006, Neurocomputing.

[15]  Yang Gao,et al.  Compact Bilinear Pooling , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[17]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Wenguan Wang,et al.  Deep Visual Attention Prediction , 2017, IEEE Transactions on Image Processing.

[19]  Feng Wu,et al.  Background Prior-Based Salient Object Detection via Deep Reconstruction Residual , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Lars Petersson,et al.  Bilinear Attention Networks for Person Retrieval , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[21]  Subhransu Maji,et al.  Improved Bilinear Pooling with CNNs , 2017, BMVC.

[22]  Victor Leboran,et al.  Dynamic Whitening Saliency , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[23]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[24]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[25]  Aykut Erdem,et al.  Visual saliency estimation by nonlinearly integrating features using region covariances. , 2013, Journal of vision.

[26]  Rita Cucchiara,et al.  A deep multi-level network for saliency prediction , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[27]  Feiping Nie,et al.  Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering , 2017, IEEE Transactions on Image Processing.

[28]  Jianxiong Xiao,et al.  What makes an image memorable? , 2011, CVPR 2011.

[29]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[30]  Mubarak Shah,et al.  Action MACH a spatio-temporal Maximum Average Correlation Height filter for action recognition , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Matei Mancas,et al.  Memorability of natural scenes: The role of attention , 2013, 2013 IEEE International Conference on Image Processing.

[32]  Rita Cucchiara,et al.  Predicting Human Eye Fixations via an LSTM-Based Saliency Attentive Model , 2016, IEEE Transactions on Image Processing.

[33]  Hanqiu Sun,et al.  Video Saliency Prediction Using Spatiotemporal Residual Attentive Networks , 2020, IEEE Transactions on Image Processing.

[34]  Junwei Han,et al.  A Deep Spatial Contextual Long-Term Recurrent Convolutional Network for Saliency Detection , 2016, IEEE Transactions on Image Processing.

[35]  Qinghua Hu,et al.  SG-FCN: A Motion and Memory-Based Deep Learning Model for Video Saliency Detection , 2018, IEEE Transactions on Cybernetics.

[36]  Antón García-Díaz,et al.  Saliency from hierarchical adaptation through decorrelation and variance normalization , 2012, Image Vis. Comput..

[37]  Qi Zhao,et al.  SALICON: Saliency in Context , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Noel E. O'Connor,et al.  Simple vs complex temporal recurrences for video saliency prediction , 2019, BMVC.

[39]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[41]  Ali Borji,et al.  Revisiting Video Saliency: A Large-Scale Benchmark and a New Model , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[42]  Naila Murray,et al.  End-to-End Saliency Mapping via Probability Distribution Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Neil D. B. Bruce,et al.  A Deeper Look at Saliency: Feature Contrast, Semantics, and Beyond , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Martin D. Levine,et al.  Visual Saliency Based on Scale-Space Analysis in the Frequency Domain , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[45]  Matthias Bethge,et al.  Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet , 2014, ICLR.

[46]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[47]  Ramesh Raskar,et al.  Learning Gaze Transitions from Depth to Improve Video Saliency Estimation , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).

[48]  Ali Borji,et al.  Exploiting local and global patch rarities for saliency detection , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[49]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[50]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[51]  Stan Sclaroff,et al.  Saliency Detection: A Boolean Map Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[52]  Fatih Murat Porikli,et al.  A Deeper Look at Power Normalizations , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[53]  Aykut Erdem,et al.  Spatio-Temporal Saliency Networks for Dynamic Saliency Prediction , 2016, IEEE Transactions on Multimedia.

[54]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[55]  Noel E. O'Connor,et al.  Shallow and Deep Convolutional Networks for Saliency Prediction , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[57]  Xinlei Chen,et al.  Spatial Memory for Context Reasoning in Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[58]  Rainer Stiefelhagen,et al.  Quaternion-Based Spectral Saliency Detection for Eye Fixation Prediction , 2012, ECCV.

[59]  Chokri Ben Amar,et al.  Transfer learning with deep networks for saliency prediction in natural video , 2016, 2016 IEEE International Conference on Image Processing (ICIP).