Prediction of visual saliency in video with deep CNNs

Prediction of visual saliency in images and video is a highly researched topic. Target applications include Quality assessment of multimedia services in mobile context, video compression techniques, recognition of objects in video streams, etc. In the framework of mobile and egocentric perspectives, visual saliency models cannot be founded only on bottom-up features, as suggested by feature integration theory. The central bias hypothesis, is not respected neither. In this case, the top-down component of human visual attention becomes prevalent. Visual saliency can be predicted on the basis of seen data. Deep Convolutional Neural Networks (CNN) have proven to be a powerful tool for prediction of salient areas in stills. In our work we also focus on sensitivity of human visual system to residual motion in a video. A Deep CNN architecture is designed, where we incorporate input primary maps as color values of pixels and magnitude of local residual motion. Complementary contrast maps allow for a slight increase of accuracy compared to the use of color and residual motion only. The experiments show that the choice of the input features for the Deep CNN depends on visual task:for th eintersts in dynamic content, the 4K model with residual motion is more efficient, and for object recognition in egocentric video the pure spatial input is more appropriate.

[1]  Sven Behnke,et al.  Large-scale object recognition with CUDA-accelerated hierarchical neural networks , 2009, 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems.

[2]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[3]  Jenny Benois-Pineau,et al.  Hierarchical Hidden Markov Model in detecting activities of daily living in wearable videos for studies of dementia , 2011, Multimedia Tools and Applications.

[4]  Cristian Sminchisescu,et al.  Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Qi Zhao,et al.  Learning to predict eye fixations for semantic contents using multi-layer sparse network , 2014, Neurocomputing.

[6]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[7]  David S Wooding,et al.  Eye movements of large populations: II. Deriving regions of interest, coverage, and similarity using fixation maps , 2002, Behavior research methods, instruments, & computers : a journal of the Psychonomic Society, Inc.

[8]  Cordelia Schmid,et al.  Actions in context , 2009, CVPR.

[9]  Jitendra Malik,et al.  Region-Based Convolutional Networks for Accurate Object Detection and Segmentation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[10]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[11]  V. Lamme,et al.  Bottom-up and top-down attention are independent. , 2013, Journal of vision.

[12]  Trevor Darrell,et al.  Caffe: Convolutional Architecture for Fast Feature Embedding , 2014, ACM Multimedia.

[13]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[14]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[15]  Thierry Baccino,et al.  Methods for comparing scanpaths and saliency maps: strengths and weaknesses , 2012, Behavior Research Methods.

[16]  Ling Shao,et al.  Spatial and temporal visual attention prediction in videos using eye movement data , 2014, Neurocomputing.