Hybrid convolutional neural networks and optical flow for video visual attention prediction

In this paper, a convolutional neural networks (CNN) and optical flow based method is proposed for prediction of visual attention in the videos. First, a deep-learning framework is employed to extract spatial features in frames to replace those commonly used handcrafted features. The optical flow is calculated to obtain the temporal feature of the moving objects in video frames, which always draw audiences’ attentions. By integrating these two groups of features, a hybrid spatial temporal feature set is obtained and taken as the input of a support vector machine (SVM) to predict the degree of visual attention. Finally, two publicly available video datasets were used to test the performance of the proposed model, where the results have demonstrated the efficacy of the proposed approach.

[1]  Xiaogang Wang,et al.  Visual Tracking with Fully Convolutional Networks , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[2]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[3]  Dit-Yan Yeung,et al.  Learning a Deep Compact Image Representation for Visual Tracking , 2013, NIPS.

[4]  Huchuan Lu,et al.  CNN for saliency detection with low-level feature integration , 2017, Neurocomputing.

[5]  Jinchang Ren,et al.  Efficient detection of temporally impulsive dirt impairments in archived films , 2007, Signal Process..

[6]  Qingshan Liu,et al.  Temporal spectral residual: fast motion saliency detection , 2009, ACM Multimedia.

[7]  Wonjun Kim,et al.  Spatiotemporal Saliency Detection and Its Applications in Static and Dynamic Scenes , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[8]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[9]  Deyu Meng,et al.  Co-Saliency Detection via a Self-Paced Multiple-Instance Learning Framework , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Feiping Nie,et al.  Revisiting Co-Saliency Detection: A Novel Approach Based on Two-Stage Multi-View Spectral Rotation Co-clustering , 2017, IEEE Transactions on Image Processing.

[11]  Lei Guo,et al.  An Object-Oriented Visual Saliency Detection Framework Based on Sparse Coding Representations , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[12]  Pingkun Yan,et al.  Sparse coding for image denoising using spike and slab prior , 2013, Neurocomputing.

[13]  Qi Zhao,et al.  Learning to predict eye fixations for semantic contents using multi-layer sparse network , 2014, Neurocomputing.

[14]  Hao Chen,et al.  CNNs-Based RGB-D Saliency Detection via Cross-View Transfer and Multiview Fusion , 2017 .

[15]  Yan Liu,et al.  Video Saliency Detection via Dynamic Consistent Spatio-Temporal Attention Modelling , 2013, AAAI.

[16]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[17]  Xiaojun Chang,et al.  Revealing Event Saliency in Unconstrained Video Collection , 2017, IEEE Transactions on Image Processing.

[18]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[19]  Lihi Zelnik-Manor,et al.  Context-aware saliency detection , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[20]  Zhe Wu,et al.  Video saliency prediction with optimized optical flow and gravity center bias , 2016, 2016 IEEE International Conference on Multimedia and Expo (ICME).

[21]  Junwei Han,et al.  Efficient, simultaneous detection of multi-class geospatial targets based on visual saliency modeling and discriminative learning of sparse coding , 2014 .

[22]  Feiping Nie,et al.  Robust Object Co-Segmentation Using Background Prior , 2018, IEEE Transactions on Image Processing.

[23]  Jitendra Malik,et al.  Learning to segment moving objects in videos , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Thomas Martinetz,et al.  Variability of eye movements when viewing dynamic natural scenes. , 2010, Journal of vision.

[25]  Ming Yang,et al.  3D Convolutional Neural Networks for Human Action Recognition , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Dong Xu,et al.  Advanced Deep-Learning Techniques for Salient and Category-Specific Object Detection: A Survey , 2018, IEEE Signal Processing Magazine.

[27]  Thomas Martinetz,et al.  Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Robert A. Marino,et al.  Free viewing of dynamic stimuli by humans and monkeys. , 2009, Journal of vision.

[29]  A. Mizuno,et al.  A change of the leading player in flow Visualization technique , 2006, J. Vis..

[30]  Stanley S. Ipson,et al.  Fusion of intensity and inter-component chromatic difference for effective and robust colour edge detection , 2010 .

[31]  Yizhou Yu,et al.  Visual Saliency Detection Based on Multiscale Deep CNN Features , 2016, IEEE Transactions on Image Processing.

[32]  Ming-Hsuan Yang,et al.  Semantic Co-segmentation in Videos , 2016, ECCV.

[33]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[34]  Pietro Perona,et al.  Is bottom-up attention useful for object recognition? , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[35]  Ling Shao,et al.  Consistent Video Saliency Using Local Gradient Flow Optimization and Global Refinement , 2015, IEEE Transactions on Image Processing.

[36]  Yi Yang,et al.  Compact and Discriminative Descriptor Inference Using Multi-Cues , 2015, IEEE Transactions on Image Processing.

[37]  Jinchang Ren,et al.  Hierarchical Modeling and Adaptive Clustering for Real-Time Summarization of Rush Videos , 2009, IEEE Transactions on Multimedia.

[38]  Yihong Gong,et al.  Human Tracking Using Convolutional Neural Networks , 2010, IEEE Transactions on Neural Networks.

[39]  Ling Shao,et al.  Spatial and temporal visual attention prediction in videos using eye movement data , 2014, Neurocomputing.

[40]  Stefano F. Cappa,et al.  The integration of parallel and serial processing mechanisms in visual search: evidence from eye movement recording , 2001 .

[41]  Xiaodong Gu,et al.  Video attention saliency mapping using pulse coupled neural network and optical flow , 2014, 2014 International Joint Conference on Neural Networks (IJCNN).

[42]  Xiaoqiang Lu,et al.  Semi-supervised change detection method for multi-temporal hyperspectral images , 2015, Neurocomputing.

[43]  Shutao Li,et al.  Novel Two-Dimensional Singular Spectrum Analysis for Effective Feature Extraction and Data Classification in Hyperspectral Imaging , 2015, IEEE Transactions on Geoscience and Remote Sensing.

[44]  Shumeet Baluja,et al.  Using a Saliency Map for Active Spatial Selective Attention: Implementation & Initial Results , 1994, NIPS.

[45]  Ling Shao,et al.  Deep Learning For Video Saliency Detection , 2017, ArXiv.

[46]  Junwei Han,et al.  A Unified Metric Learning-Based Framework for Co-Saliency Detection , 2018, IEEE Transactions on Circuits and Systems for Video Technology.

[47]  Yukie Nagai,et al.  On Constructing a Communicative Space in HRI , 2007, KI.

[48]  Ming-Hsuan Yang,et al.  Hierarchical Convolutional Features for Visual Tracking , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[49]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[50]  Lei Guo,et al.  When Deep Learning Meets Metric Learning: Remote Sensing Image Scene Classification via Learning Discriminative CNNs , 2018, IEEE Transactions on Geoscience and Remote Sensing.

[51]  Joachim Köhler,et al.  LIVE: An Integrated Production and Feedback System for Intelligent and Interactive TV Broadcasting , 2011, IEEE Transactions on Broadcasting.

[52]  Xuelong Li,et al.  Latent Semantic Minimal Hashing for Image Retrieval , 2017, IEEE Transactions on Image Processing.

[53]  Aykut Erdem,et al.  Two-Stream Convolutional Networks for Dynamic Saliency Prediction , 2016, ArXiv.

[54]  Yi Yang,et al.  Semisupervised Feature Selection via Spline Regression for Video Semantic Recognition , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[55]  Xiaogang Wang,et al.  Saliency detection by multi-context deep learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[56]  Qingshan Liu,et al.  Robust Visual Tracking via Convolutional Networks Without Training , 2016, IEEE Transactions on Image Processing.

[57]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[58]  Mei-Ling Shyu,et al.  Semantic Retrieval for Videos in Non-static Background Using Motion Saliency and Global Features , 2013, 2013 IEEE Seventh International Conference on Semantic Computing.

[59]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.