Camera-Assisted Video Saliency Prediction and Its Applications

Video saliency prediction is an indispensable yet challenging technique which can facilitate various applications, such as video surveillance, autonomous driving, and realistic rendering. Based on the popularity of embedded cameras, we in this paper predict region-level saliency from videos by leveraging human gaze locations recorded using a camera, (e.g., those equipped on an iMAC and laptop PC). Our proposed camera-assisted mechanism improves saliency prediction by discovering human attended regions inside a video clip. It is orthogonal to the current saliency models, i.e., any existing video/image saliency model can be boosted by our mechanism. First of all, the spatial-and temporal-level visual features are exploited collaboratively for calculating an initial saliency map. We notice that the current saliency models are not sufficiently adaptable to the variations in lighting, different view angles, and complicated backgrounds. Therefore, assisted by a camera tracking human gaze movements, a non-negative matrix factorization algorithm is designed to accurately localize the semantically/visually salient video regions perceived by humans. Finally, the learned human gaze locations as well as the initial saliency map are integrated to optimize video saliency calculation. Empirical results thoroughly demonstrated that: 1) our approach achieves the state-of-the-art video saliency prediction accuracy by outperforming 11 mainstream algorithms considerably and 2) our method can conveniently and successfully enhance video retargeting, quality estimation, and summarization.

[1]  Antonio Torralba,et al.  Modeling the Shape of the Scene: A Holistic Representation of the Spatial Envelope , 2001, International Journal of Computer Vision.

[2]  Yi Yang,et al.  Weakly Supervised Photo Cropping , 2014, IEEE Transactions on Multimedia.

[3]  Gang Wang,et al.  Recurrent Attentional Networks for Saliency Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Yale Song,et al.  TVSum: Summarizing web videos using titles , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Xuelong Li,et al.  Nonnegative Discriminant Matrix Factorization , 2017, IEEE Transactions on Circuits and Systems for Video Technology.

[6]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[7]  Xuelong Li,et al.  Semantic Photo Retargeting Under Noisy Image Labels , 2016, TOMM.

[8]  Yi Yang,et al.  A Probabilistic Associative Model for Segmenting Weakly Supervised Images , 2014, IEEE Transactions on Image Processing.

[9]  O. Sorkine,et al.  Optimized scale-and-stretch for image resizing , 2008, SIGGRAPH 2008.

[10]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[11]  Ali Borji,et al.  Computational Modeling of Top-down Visual Attention in Interactive Environments , 2011, BMVC.

[12]  Naila Murray,et al.  AVA: A large-scale database for aesthetic visual analysis , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Nuno Vasconcelos,et al.  Spatiotemporal Saliency in Dynamic Scenes , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Qiaosong Wang,et al.  GraB: Visual Saliency via Novel Graph Model and Background Priors , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ming-Hsuan Yang,et al.  Top-down visual saliency via joint CRF and dictionary learning , 2012, CVPR.

[16]  Cristian Sminchisescu,et al.  Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition , 2012, ECCV.

[17]  Xuelong Li,et al.  Actively Learning Human Gaze Shifting Paths for Semantics-Aware Photo Cropping , 2014, IEEE Transactions on Image Processing.

[18]  Nathalie Guyader,et al.  Parallel implementation of a spatio-temporal visual saliency model , 2010, Journal of Real-Time Image Processing.

[19]  Zhou Wang,et al.  Video saliency incorporating spatiotemporal cues and uncertainty weighting , 2013, ICME.

[20]  Yizhou Yu,et al.  Visual saliency based on multiscale deep features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Xiao Liu,et al.  Semi-supervised Node Splitting for Random Forest Construction , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[22]  W. Chu Studying Aesthetics in Photographic Images Using a Computational Approach , 2013 .

[23]  Olga Sorkine-Hornung,et al.  A comparative study of image retargeting , 2010, ACM Trans. Graph..

[24]  Xuelong Li,et al.  Large-Scale Aerial Image Categorization Using a Multitask Topological Codebook , 2016, IEEE Transactions on Cybernetics.

[25]  Ali Borji,et al.  Probabilistic learning of task-specific visual attention , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Rudolf Fleischer,et al.  Distance Approximating Dimension Reduction of Riemannian Manifolds , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[27]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[28]  Thomas Mauthner,et al.  Encoding based saliency detection for videos and images , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Xuelong Li,et al.  Fusion of Multichannel Local and Global Structural Cues for Photo Aesthetics Evaluation , 2014, IEEE Transactions on Image Processing.

[30]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[31]  Yi Yang,et al.  Weakly Supervised Human Fixations Prediction , 2016, IEEE Transactions on Cybernetics.

[32]  Christoph Schnörr,et al.  Learning Sparse Representations by Non-Negative Matrix Factorization and Sequential Cone Programming , 2006, J. Mach. Learn. Res..

[33]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[34]  Jing Zhao,et al.  Document Clustering Based on Nonnegative Sparse Matrix Factorization , 2005, ICNC.

[35]  Gayoung Lee,et al.  Deep Saliency with Encoded Low Level Distance Map and High Level Features , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[37]  Xuelong Li,et al.  Image Categorization by Learning a Propagated Graphlet Path , 2016, IEEE Transactions on Neural Networks and Learning Systems.

[38]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[39]  Xiao Liu,et al.  Probabilistic Graphlet Transfer for Photo Cropping , 2013, IEEE Transactions on Image Processing.

[40]  Yue Gao,et al.  Probabilistic Skimlets Fusion for Summarizing Multiple Consumer Landmark Videos , 2015, IEEE Transactions on Multimedia.

[41]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[42]  Ariel Shamir,et al.  Improved seam carving for video retargeting , 2008, SIGGRAPH 2008.

[43]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[45]  Yi Yang,et al.  Discovering Discriminative Graphlets for Aerial Image Categories Recognition , 2013, IEEE Transactions on Image Processing.

[46]  Yueting Zhuang,et al.  Saliency Detection within a Deep Convolutional Architecture , 2014, AAAI 2014.

[47]  Qi Tian,et al.  Perception-Guided Multimodal Feature Fusion for Photo Aesthetics Assessment , 2014, ACM Multimedia.

[48]  Junwei Han,et al.  DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[50]  Mei Han,et al.  Discontinuous seam-carving for video retargeting , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[51]  Pascal Fua,et al.  SLIC Superpixels Compared to State-of-the-Art Superpixel Methods , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[52]  Xuelong Li,et al.  Detecting Densely Distributed Graph Patterns for Fine-Grained Image Categorization , 2016, IEEE Transactions on Image Processing.

[53]  Hujun Bao,et al.  Non-negative local coordinate factorization for image representation , 2011, CVPR.

[54]  Vicente Ordonez,et al.  High level describable attributes for predicting aesthetics and interestingness , 2011, CVPR 2011.

[55]  Bin Zhao,et al.  Visual Saliency Models Based on Spectrum Processing , 2015, 2015 IEEE Winter Conference on Applications of Computer Vision.

[56]  Yihong Gong,et al.  Nonlinear Learning using Local Coordinate Coding , 2009, NIPS.

[57]  Yong Xu,et al.  Characteristic Gene Selection Based on Robust Graph Regularized Non-Negative Matrix Factorization , 2016, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[58]  Xiao Liu,et al.  Probabilistic Graphlet Cut: Exploiting Spatial Structure Cue for Weakly Supervised Image Segmentation , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[59]  Xiaojun Wu,et al.  Graph Regularized Nonnegative Matrix Factorization for Data Representation , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[60]  Xuelong Li,et al.  Saliency Detection by Multiple-Instance Learning , 2013, IEEE Transactions on Cybernetics.

[61]  Markus A. Stricker,et al.  Similarity of color images , 1995, Electronic Imaging.

[62]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[63]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[64]  Hong Qiao,et al.  Learning an Intrinsic-Variable Preserving Manifold for Dynamic Visual Tracking , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[65]  Ling Shao,et al.  Learning Discriminative Key Poses for Action Recognition , 2013, IEEE Transactions on Cybernetics.

[66]  James M. Rehg,et al.  The Secrets of Salient Object Segmentation , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[67]  Harish Katti,et al.  An Eye Fixation Database for Saliency Detection in Images , 2010, ECCV.

[68]  Katerina Pastra,et al.  COSMOROE: a cross-media relations framework for modelling multimedia dialectics , 2008, Multimedia Systems.

[69]  Xuelong Li,et al.  Visual-Context Boosting for Eye Detection , 2010, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[70]  Jian Shi,et al.  Image Retargeting Using Mesh Parametrization , 2009, IEEE Transactions on Multimedia.

[71]  Junchi Yan,et al.  Visual Saliency Detection via Sparsity Pursuit , 2010, IEEE Signal Processing Letters.

[72]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[73]  Lihi Zelnik-Manor,et al.  Context-Aware Saliency Detection , 2012, IEEE Trans. Pattern Anal. Mach. Intell..

[74]  Pingkun Yan,et al.  Learning Saliency by MRF and Differential Threshold , 2013, IEEE Transactions on Cybernetics.

[75]  Xiaogang Wang,et al.  Saliency detection by multi-context deep learning , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  S. Palmer Hierarchical structure in perceptual representation , 1977, Cognitive Psychology.

[77]  Nicu Sebe,et al.  Collaborative Sparse Coding for Multiview Action Recognition , 2016, IEEE MultiMedia.

[78]  Wen Gao,et al.  Probabilistic Multi-Task Learning for Visual Saliency Estimation in Video , 2010, International Journal of Computer Vision.

[79]  Stan Z. Li,et al.  Learning spatially localized, parts-based representation , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[80]  Yan Ke,et al.  The Design of High-Level Features for Photo Quality Assessment , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[81]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[82]  Yue Gao,et al.  Feature Correlation Hypergraph: Exploiting High-order Potentials for Multimodal Recognition , 2014, IEEE Transactions on Cybernetics.

[83]  Garrison W. Cottrell,et al.  Visual saliency model for robot cameras , 2008, 2008 IEEE International Conference on Robotics and Automation.

[84]  Mubarak Shah,et al.  Visual attention detection in video sequences using spatiotemporal cues , 2006, MM '06.

[85]  Alan C. Bovik,et al.  A Statistical Evaluation of Recent Full Reference Image Quality Assessment Algorithms , 2006, IEEE Transactions on Image Processing.

[86]  Chun Chen,et al.  Feature selection for fast speech emotion recognition , 2009, ACM Multimedia.

[87]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[88]  Xuelong Li,et al.  Rank Preserving Sparse Learning for Kinect Based Scene Classification , 2013, IEEE Transactions on Cybernetics.

[89]  Ariel Shamir,et al.  Seam Carving for Content-Aware Image Resizing , 2007, ACM Trans. Graph..

[90]  Ling Shao,et al.  Perceptually Guided Photo Retargeting , 2017, IEEE Transactions on Cybernetics.

[91]  Chun Chen,et al.  Active Learning Based on Locally Linear Reconstruction , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[92]  Yue Gao,et al.  Representative Discovery of Structure Cues for Weakly-Supervised Image Segmentation , 2014, IEEE Transactions on Multimedia.

[93]  Wei Luo,et al.  Content-Based Photo Quality Assessment , 2013, IEEE Trans. Multim..

[94]  Ali Borji,et al.  An Object-Based Bayesian Framework for Top-Down Visual Attention , 2012, AAAI.

[95]  Luc Van Gool,et al.  Creating Summaries from User Videos , 2014, ECCV.

[96]  Ling Shao,et al.  Spatio-Temporal Laplacian Pyramid Coding for Action Recognition , 2014, IEEE Transactions on Cybernetics.