Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos

This paper presents a spatio-temporal saliency model that predicts eye movement during video free viewing. This model is inspired by the biology of the first steps of the human visual system. The model extracts two signals from video stream corresponding to the two main outputs of the retina: parvocellular and magnocellular. Then, both signals are split into elementary feature maps by cortical-like filters. These feature maps are used to form two saliency maps: a static and a dynamic one. These maps are then fused into a spatio-temporal saliency map. The model is evaluated by comparing the salient areas of each frame predicted by the spatio-temporal saliency map to the eye positions of different subjects during a free video viewing experiment with a large database (17000 frames). In parallel, the static and the dynamic pathways are analyzed to understand what is more or less salient and for what type of videos our model is a good or a poor predictor of eye movement.

[1]  L. Itti,et al.  Visual causes versus correlates of attentional selection in dynamic scenes , 2006, Vision Research.

[2]  John K. Tsotsos,et al.  The different stages of visual recognition need different attentional binding strategies , 2008, Brain Research.

[3]  Laurent Itti,et al.  Applying computational tools to predict gaze direction in interactive visual environments , 2008, TAP.

[4]  Patrick Le Callet,et al.  A coherent computational approach to model bottom-up visual attention , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Alan C. Bovik,et al.  Point-of-gaze analysis reveals visual search strategies , 2004, IS&T/SPIE Electronic Imaging.

[6]  Eric Bruno,et al.  Robust motion estimation using spatial Gabor-like filters , 2002, Signal Process..

[7]  Asha Iyer,et al.  Components of bottom-up gaze allocation in natural images , 2005, Vision Research.

[8]  Jeanny Hérault,et al.  Realistic Simulation Tool for Early Visual Processing Including Space, Time and Colour Data , 1993, IWANN.

[9]  Derrick J. Parkhurst,et al.  Modeling the role of salience in the allocation of overt visual attention , 2002, Vision Research.

[10]  S. Yantis,et al.  Visual attention: control, representation, and time course. , 1997, Annual review of psychology.

[11]  Steven H. Schwartz,et al.  Visual Perception: A Clinical Orientation , 1998 .

[12]  Russell L. De Valois,et al.  Orientation and Spatial Frequency Selectivity: Properties and Modular Organization , 1991 .

[13]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[14]  Jean-Marc Odobez,et al.  Robust Multiresolution Estimation of Parametric Motion Models , 1995, J. Vis. Commun. Image Represent..

[15]  Thierry Pun,et al.  Integration of bottom-up and top-down cues for visual attention using non-linear relaxation , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Iain D. Gilchrist,et al.  Visual correlates of fixation selection: effects of scale and time , 2005, Vision Research.

[17]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[18]  Nathalie Guyader,et al.  Spatio-temporal saliency model to predict eye movements in video free viewing , 2008, 2008 16th European Signal Processing Conference.

[19]  Jeanny Hérault,et al.  Model of Frequency Analysis in the Visual Cortex and the Shape from Texture Problem , 2008, International Journal of Computer Vision.

[20]  J. Wolfe,et al.  Attention is fast but volition is slow , 2000, Nature.

[21]  E. J. Morris,et al.  Visual motion processing and sensory-motor integration for smooth pursuit eye movements. , 1987, Annual review of neuroscience.

[22]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[23]  Heiko Neumann,et al.  Recurrent Long-Range Interactions in Early Vision , 2001, Emergent Neural Computational Architectures Based on Neuroscience.

[24]  O. Meur,et al.  Predicting visual fixations on video based on low-level visual features , 2007, Vision Research.

[25]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[26]  D. Hubel,et al.  Ferrier lecture - Functional architecture of macaque monkey visual cortex , 1977, Proceedings of the Royal Society of London. Series B. Biological Sciences.

[27]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[28]  Nathalie Guyader,et al.  Video Summarization Based on Camera Motion and a Subjective Evaluation Method , 2007, EURASIP J. Image Video Process..

[29]  Antonio Torralba,et al.  Contextual guidance of eye movements and attention in real-world scenes: the role of global features in object search. , 2006, Psychological review.

[30]  Susan L. Franzel,et al.  Guided search: an alternative to the feature integration model for visual search. , 1989, Journal of experimental psychology. Human perception and performance.

[31]  J. Daugman Two-dimensional spectral analysis of cortical receptive field profiles , 1980, Vision Research.

[32]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[33]  P Reinagel,et al.  Natural scene statistics at the centre of gaze. , 1999, Network.