Deep Learning for Saliency Prediction in Natural Video

The purpose of this paper is the detection of salient areas in natural video by using the new deep learning techniques. Salient patches in video frames are predicted first. Then the predicted visual fixation maps are built upon them. We design the deep architecture on the basis of CaffeNet implemented with Caffe toolkit. We show that changing the way of data selection for optimisation of network parameters, we can save computation cost up to $12$ times. We extend deep learning approaches for saliency prediction in still images with RGB values to specificity of video using the sensitivity of the human visual system to residual motion. Furthermore, we complete primary colour pixel values by contrast features proposed in classical visual attention prediction models. The experiments are conducted on two publicly available datasets. The first is IRCCYN video database containing $31$ videos with an overall amount of $7300$ frames and eye fixations of $37$ subjects. The second one is HOLLYWOOD2 provided $2517$ movie clips with the eye fixations of $19$ subjects. On IRCYYN dataset, the accuracy obtained is of $89.51\% $. On HOLLYWOOD2 dataset, results in prediction of saliency of patches show the improvement up to $2\%$ with regard to RGB use only. The resulting accuracy of $76,6\%$ is obtained. The AUC metric in comparison of predicted saliency maps with visual fixation maps shows the increase up to $16\%$ on a sample of video clips from this dataset.

[1]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[2]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[3]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[4]  Scott Daly,et al.  Engineering observations from spatiovelocity and spatiotemporal visual models , 1998, Electronic Imaging.

[5]  Michael Dorr,et al.  Large-Scale Optimization of Hierarchical Features for Saliency Prediction in Natural Images , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[6]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[7]  Sophie Marat Modèles de saillance visuelle par fusion d'informations sur la luminance, le mouvement et les visages pour la prédiction de mouvements oculaires lors de l'exploration de vidéos. , 2010 .

[8]  O. Meur,et al.  Predicting visual fixations on video based on low-level visual features , 2007, Vision Research.

[9]  V. Lamme,et al.  Bottom-up and top-down attention are independent. , 2013, Journal of vision.

[10]  Nuno Vasconcelos,et al.  On the plausibility of the discriminant center-surround hypothesis for visual saliency. , 2008, Journal of vision.

[11]  Ingrid Heynderickx,et al.  Comparative Study of Fixation Density Maps , 2013, IEEE Transactions on Image Processing.

[12]  Wei Chen,et al.  Region-of-Interest intra prediction for H.264/AVC error resilience , 2009, 2009 16th IEEE International Conference on Image Processing (ICIP).

[13]  Sven Behnke,et al.  Accelerating Large-Scale Convolutional Neural Networks with Parallel Graphics Multiprocessors , 2010, ICANN.

[14]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[16]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Qi Zhao,et al.  Learning to predict eye fixations for semantic contents using multi-layer sparse network , 2014, Neurocomputing.

[18]  Dominique Barba,et al.  Cartes de Saillance Spatio-Temporelle basées Contrastes de Couleur et Mouvement Relatif , 2009 .

[19]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[20]  Ling Shao,et al.  Spatial and temporal visual attention prediction in videos using eye movement data , 2014, Neurocomputing.

[21]  Hongsheng Li,et al.  Silhouette Analysis for Human Action Recognition Based on Supervised Temporal t-SNE and Incremental Learning , 2015, IEEE Transactions on Image Processing.

[22]  David S Wooding,et al.  Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[23]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[24]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[25]  Nuno Vasconcelos,et al.  Integrated learning of saliency, complex features, and object detectors from cluttered scenes , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[26]  Jenny Benois-Pineau,et al.  Fusion of Multiple Visual Cues for Object Recognition in Videos , 2014, Fusion in Computer Vision.

[27]  Antonio Torralba,et al.  Ieee Transactions on Pattern Analysis and Machine Intelligence 1 80 Million Tiny Images: a Large Dataset for Non-parametric Object and Scene Recognition , 2022 .

[28]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[29]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[33]  Sven Behnke,et al.  Evaluation of Pooling Operations in Convolutional Architectures for Object Recognition , 2010, ICANN.

[34]  Dong Yu,et al.  Deep Learning: Methods and Applications , 2014, Found. Trends Signal Process..

[35]  Nicolas Pinto,et al.  Beyond simple features: A large-scale feature search approach to unconstrained face recognition , 2011, Face and Gesture 2011.

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Andrew Zisserman,et al.  Two-Stream Convolutional Networks for Action Recognition in Videos , 2014, NIPS.

[38]  Sven Behnke,et al.  Exploiting local structure in Boltzmann machines , 2011, Neurocomputing.

[39]  Matthew H Tong,et al.  SUN: Top-down saliency using natural statistics , 2009, Visual cognition.

[40]  Cordelia Schmid,et al.  Actions in context , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[41]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[42]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[43]  Sven Behnke,et al.  Large-scale object recognition with CUDA-accelerated hierarchical neural networks , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[44]  Jenny Benois-Pineau,et al.  Fusion of Multiple Visual Cues for Visual Saliency Extraction from Wearable Camera Settings with Strong Motion , 2012, ECCV Workshops.

[45]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[46]  Matthieu Cord,et al.  Pooling in image representation: The visual codeword point of view , 2013, Comput. Vis. Image Underst..

[47]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[48]  Thierry Baccino,et al.  Methods for comparing scanpaths and saliency maps: strengths and weaknesses , 2012, Behavior Research Methods.