Predicting Video Saliency with Object-to-Motion CNN and Two-layer Convolutional LSTM

Over the past few years, deep neural networks (DNNs) have exhibited great success in predicting the saliency of images. However, there are few works that apply DNNs to predict the saliency of generic videos. In this paper, we propose a novel DNN-based video saliency prediction method. Specifically, we establish a large-scale eye-tracking database of videos (LEDOV), which provides sufficient data to train the DNN models for predicting video saliency. Through the statistical analysis of our LEDOV database, we find that human attention is normally attracted by objects, particularly moving objects or the moving parts of objects. Accordingly, we propose an object-to-motion convolutional neural network (OM-CNN) to learn spatio-temporal features for predicting the intra-frame saliency via exploring the information of both objectness and object motion. We further find from our database that there exists a temporal correlation of human attention with a smooth saliency transition across video frames. Therefore, we develop a two-layer convolutional long short-term memory (2C-LSTM) network in our DNN-based method, using the extracted features of OM-CNN as the input. Consequently, the inter-frame saliency maps of videos can be generated, which consider the transition of attention across video frames. Finally, the experimental results show that our method advances the state-of-the-art in video saliency prediction.

[1]  Ling Shao,et al.  Video Salient Object Detection via Fully Convolutional Networks , 2017, IEEE Transactions on Image Processing.

[2]  Matthew H Tong,et al.  of the Annual Meeting of the Cognitive Science Society Title SUNDAy : Saliency Using Natural Statistics for Dynamic Analysis of Scenes Permalink , 2009 .

[3]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[4]  Chang-Su Kim,et al.  Video saliency detection based on spatiotemporal feature learning , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[5]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[6]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[7]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[8]  Liang-Tien Chia,et al.  Regularized Feature Reconstruction for Spatio-Temporal Saliency Detection , 2013, IEEE Transactions on Image Processing.

[9]  Dmitriy Vatolin,et al.  Semiautomatic visual-attention modeling and its application to video compression , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[10]  Wei Chen,et al.  Region-of-Interest intra prediction for H.264/AVC error resilience , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[11]  Judith Redi,et al.  Examining the effect of task on viewing behavior in videos using saliency maps , 2012, Electronic Imaging.

[12]  Nicolas Riche,et al.  Dynamic Saliency Models and Human Attention: A Comparative Study on Videos , 2012, ACCV.

[13]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[14]  Weisi Lin,et al.  A Video Saliency Detection Model in Compressed Domain , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[15]  A. Coutrot,et al.  An efficient audiovisual saliency model to predict eye positions when looking at conversations , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[16]  R. Venkatesh Babu,et al.  DeepFix: A Fully Convolutional Neural Network for Predicting Human Eye Fixations , 2015, IEEE Transactions on Image Processing.

[17]  Denis Pellerin,et al.  Video summarization using a visual attention model , 2007, 2007 15th European Signal Processing Conference.

[18]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[19]  Stan Sclaroff,et al.  Exploiting Surroundedness for Saliency Detection: A Boolean Map Approach , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Mohan S. Kankanhalli,et al.  Static saliency vs. dynamic saliency: a comparative study , 2013, ACM Multimedia.

[21]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[22]  Alan C. Bovik,et al.  GAFFE: A Gaze-Attentive Fixation Finding Engine , 2008, IEEE Transactions on Image Processing.

[23]  Lihi Zelnik-Manor,et al.  Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[24]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[25]  Hugo Larochelle,et al.  Recurrent Mixture Density Network for Spatiotemporal Visual Attention , 2016, ICLR.

[26]  Zhou Wang,et al.  Video saliency incorporating spatiotemporal cues and uncertainty weighting , 2013, 2013 IEEE International Conference on Multimedia and Expo (ICME).

[27]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[28]  L. Itti,et al.  Visual causes versus correlates of attentional selection in dynamic scenes , 2006, Vision Research.

[29]  Baoxin Li,et al.  Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector , 2015, ArXiv.

[30]  Wen Gao,et al.  Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video , 2010, International Journal of Computer Vision.

[31]  Feng Zhou,et al.  Time-Mapping Using Space-Time Saliency , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[32]  Mikel Rodriguez,et al.  Spatio-temporal Maximum Average Correlation Height Templates In Action Recognition And Video Summarization , 2010 .

[33]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[34]  Xiaoyan Sun,et al.  Learning to Detect Video Saliency With HEVC Features , 2017, IEEE Transactions on Image Processing.

[35]  Laurent Itti,et al.  Visual attention guided bit allocation in video compression , 2011, Image Vis. Comput..

[36]  Nuno Vasconcelos,et al.  How many bits does it take for a stimulus to be salient? , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Yueting Zhuang,et al.  DeepSaliency: Multi-Task Deep Neural Network Model for Salient Object Detection , 2015, IEEE Transactions on Image Processing.

[38]  Laurent Itti,et al.  Realistic avatar eye and head animation using a neurobiological model of visual attention , 2004, SPIE Optics + Photonics.

[39]  E. Matin Saccadic suppression: a review and an analysis. , 1974, Psychological bulletin.

[40]  Yan Liu,et al.  Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling , 2013, AAAI.

[41]  Yong Du,et al.  Hierarchical recurrent neural network for skeleton based action recognition , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Laurent Itti,et al.  Beyond bottom-up: Incorporating task-dependent influences into a computational model of spatial attention , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[43]  Huchuan Lu,et al.  Saliency Detection with Recurrent Fully Convolutional Networks , 2016, ECCV.

[44]  Qi Zhao,et al.  SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[45]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Thomas Mauthner,et al.  Encoding based saliency detection for videos and images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[47]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[48]  Ivan V. Bajic,et al.  Eye-Tracking Database for a Set of Standard Video Sequences , 2012, IEEE Transactions on Image Processing.

[49]  Alan Kennedy,et al.  Book Review: Eye Tracking: A Comprehensive Guide to Methods and Measures , 2016, Quarterly journal of experimental psychology.

[50]  Xuming He,et al.  Predicting Salient Face in Multiple-Face Videos , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[51]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Antoine Coutrot,et al.  Toward the introduction of auditory information in dynamic visual attention models , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[53]  Dit-Yan Yeung,et al.  Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting , 2015, NIPS.

[54]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[55]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[56]  Thomas Martinetz,et al.  Variability of eye movements when viewing dynamic natural scenes. , 2010, Journal of vision.

[57]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[58]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[60]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.