Deep Saliency: Prediction of Interestingness in Video with CNN

Deep Neural Networks have become winners in indexing of visual information. They have allowed achievement of better performances in the fundamental tasks of visual information indexing and retrieval such as image classification and object recognition. In fine-grain indexing tasks, namely object recognition in visual scenes, the CNNs classifiers have to evaluate multiple “object proposals”, that is windows in the image plane of different size and location. Hence the problem of recognition is coupled with the problem of localization. In this chapter a model of prediction of Areas-if-Interest in video on the basis of Deep CNNs is proposed. A Deep CNN architecture is designed to classify windows in salient and non-salient. Then dense saliency maps are built upon classification score results. Using the known sensitivity of human visual system (HVS) to residual motion, the usual primary features such as pixel colour values are completed with residual motion features. The experiments show that the choice of the input features for the Deep CNN depends on visual task: for the interest in dynamic content, the proposed model with residual motion is more efficient.

[1]  Chokri Ben Amar,et al.  Prediction of visual attention with Deep CNN for studies of neurodegenerative diseases , 2016, 2016 14th International Workshop on Content-Based Multimedia Indexing (CBMI).

[2]  Xingquan Zhu,et al.  Class Noise vs. Attribute Noise: A Quantitative Study , 2003, Artificial Intelligence Review.

[3]  Xavier Giró-i-Nieto,et al.  End-to-end Convolutional Network for Saliency Prediction , 2015, ArXiv.

[4]  James M. Rehg,et al.  Gaze-enabled egocentric video summarization via constrained submodular maximization , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[5]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[6]  Chokri Ben Amar,et al.  Transfer learning with deep networks for saliency prediction in natural video , 2016, 2016 IEEE International Conference on Image Processing (ICIP).

[7]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[8]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[9]  Yizhou Yu,et al.  Visual saliency based on multiscale deep features , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10]  Yizhou Yu,et al.  Deep Contrast Learning for Salient Object Detection , 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Albert Fornells,et al.  A study of the effect of different types of noise on the precision of supervised learning techniques , 2010, Artificial Intelligence Review.

[12]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[13]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[14]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[15]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Yoshua Bengio,et al.  Unsupervised and Transfer Learning Challenge: a Deep Learning Approach , 2011, ICML Unsupervised and Transfer Learning.

[17]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[18]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[19]  Tianming Liu,et al.  Predicting eye fixations using convolutional neural networks , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  F ROSENBLATT,et al.  The perceptron: a probabilistic model for information storage and organization in the brain. , 1958, Psychological review.

[21]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[22]  F. Attneave,et al.  The Organization of Behavior: A Neuropsychological Theory , 1949 .

[23]  Yale Song,et al.  Mouse Activity as an Indicator of Interestingness in Video , 2016, ICMR.

[24]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[25]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[26]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[27]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[28]  Qi Zhao,et al.  Learning to predict eye fixations for semantic contents using multi-layer sparse network , 2014, Neurocomputing.

[29]  Koen E. A. van de Sande,et al.  Selective Search for Object Recognition , 2013, International Journal of Computer Vision.

[30]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[31]  David S Wooding,et al.  Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[32]  Vladimir Pavlovic,et al.  Sentiment Flow for Video Interestingness Prediction , 2014, HuEvent '14.

[33]  Dumitru Erhan,et al.  Training Deep Neural Networks on Noisy Labels with Bootstrapping , 2014, ICLR.

[34]  V. Lamme,et al.  Bottom-up and top-down attention are independent. , 2013, Journal of vision.

[35]  Jenny Benois-Pineau,et al.  Perceptual modeling in the problem of active object recognition in visual scenes , 2016, Pattern Recognit..

[36]  Wei Chen,et al.  Region-of-Interest intra prediction for H.264/AVC error resilience , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[37]  Ofer Hadar,et al.  Prediction of visual saliency in video with deep CNNs , 2016, Optical Engineering + Applications.

[38]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[39]  Jenny Benois-Pineau,et al.  Extraction of foreground objects from an MPEG2 video stream in rough-indexing framework , 2003, IS&T/SPIE Electronic Imaging.

[40]  Bolei Zhou,et al.  Learning Deep Features for Discriminative Localization , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Thierry Baccino,et al.  Methods for comparing scanpaths and saliency maps: strengths and weaknesses , 2012, Behavior Research Methods.

[42]  Yuzhen Niu,et al.  Rule of Thirds Detection from Photograph , 2011, 2011 IEEE International Symposium on Multimedia.

[43]  Mohammad Soleymani,et al.  Analyzing and Predicting GIF Interestingness , 2016, ACM Multimedia.

[44]  James M. Rehg,et al.  Learning to recognize objects in egocentric activities , 2011, CVPR 2011.

[45]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[46]  Xiangyang Xue,et al.  Understanding and Predicting Interestingness of Videos , 2013, AAAI.

[47]  Vladimir Vapnik,et al.  Principles of Risk Minimization for Learning Theory , 1991, NIPS.

[48]  Xiaogang Wang,et al.  Learning from massive noisy labeled data for image classification , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[50]  L. Itti,et al.  Top-down influences on visual attention during listening are modulated by observer sex , 2012, Vision Research.