Spatial and temporal visual attention prediction in videos using eye movement data

Visual attention detection in static images has achieved outstanding progress in recent years whereas much less effort has been devoted to learning visual attention in video sequences. In this paper, we propose a novel method to model spatial and temporal visual attention for videos respectively through learning from human gaze data. The spatial visual attention mainly predicts where viewers look in each video frame while the temporal visual attention measures which video frame is more likely to attract viewers׳ interest. Our underlying premise is that objects as well as their movements, instead of conventional contrast-related information, are major factors in dynamic scenes to drive visual attention. Firstly, the proposed models extract two types of bottom-up features derived from multi-scale object filter responses and spatiotemporal motion energy, respectively. Then, spatiotemporal gaze density and inter-observer gaze congruency are generated using a large collection of human-eye gaze data to form two training sets. Finally, prediction models of temporal visual attention and spatial visual attention are learned based on those two training sets and bottom-up features, respectively. Extensive evaluations on publicly available video benchmarks and applications in interestingness prediction of movie trailers demonstrate the effectiveness of the proposed work.

[1]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[2]  Asha Iyer,et al.  Components of bottom-up gaze allocation in natural images , 2005, Vision Research.

[3]  Ivan V. Bajic,et al.  Eye-Tracking Database for a Set of Standard Video Sequences , 2012, IEEE Transactions on Image Processing.

[4]  Matthew H Tong,et al.  of the Annual Meeting of the Cognitive Science Society Title SUNDAy : Saliency Using Natural Statistics for Dynamic Analysis of Scenes Permalink , 2009 .

[5]  D. M. Green,et al.  Signal detection theory and psychophysics , 1966 .

[6]  Stefano F. Cappa,et al.  The integration of parallel and serial processing mechanisms in visual search: evidence from eye movement recording , 2001 .

[7]  Aude Oliva,et al.  Estimating perception of scene layout properties from global image features. , 2011, Journal of vision.

[8]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[9]  Hao Su,et al.  Object Bank: A High-Level Image Representation for Scene Classification & Semantic Feature Sparsification , 2010, NIPS.

[10]  Liming Zhang,et al.  A Novel Multiresolution Spatiotemporal Saliency Detection Model and Its Applications in Image and Video Compression , 2010, IEEE Transactions on Image Processing.

[11]  Thomas Martinetz,et al.  Intrinsic Dimensionality Predicts the Saliency of Natural Dynamic Scenes , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Thomas Martinetz,et al.  Variability of eye movements when viewing dynamic natural scenes. , 2010, Journal of vision.

[13]  Robert A. Marino,et al.  Free viewing of dynamic stimuli by humans and monkeys. , 2009, Journal of vision.

[14]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[15]  F. Hamker,et al.  About the influence of post-saccadic mechanisms for visual stability on peri-saccadic compression of object location. , 2008, Journal of vision.

[16]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[17]  Lei Guo,et al.  An Object-Oriented Visual Saliency Detection Framework Based on Sparse Coding Representations , 2013, IEEE Transactions on Circuits and Systems for Video Technology.

[18]  Jiang Peng,et al.  Keyframe-Based Video Summary Using Visual Attention Clues , 2010 .

[19]  King Ngi Ngan,et al.  Unsupervised extraction of visual attention objects in color images , 2006, IEEE Transactions on Circuits and Systems for Video Technology.

[20]  Ling Shao,et al.  Specific object retrieval based on salient regions , 2006, Pattern Recognit..

[21]  Aline Roumy,et al.  Prediction of the inter-observer visual congruency (IOVC) and application to image ranking , 2011, ACM Multimedia.

[22]  Ariel Shamir,et al.  Improved seam carving for video retargeting , 2008, SIGGRAPH 2008.

[23]  P. Perona,et al.  Objects predict fixations better than early saliency. , 2008, Journal of vision.

[24]  Ling Shao,et al.  Enhanced Computer Vision With Microsoft Kinect Sensor: A Review , 2013, IEEE Transactions on Cybernetics.

[25]  Richard P. Wildes,et al.  Qualitative Spatiotemporal Analysis Using an Oriented Energy Representation , 2000, ECCV.

[26]  M Kubovy,et al.  The emergence of visual objects in space-time. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[28]  Jason J. Corso,et al.  Action bank: A high-level representation of activity in video , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.

[29]  Paul M. de Zeeuw,et al.  Fast saliency-aware multi-modality image fusion , 2013, Neurocomputing.

[30]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[31]  Xuelong Li,et al.  Saliency Cut in Stereo Images , 2013, 2013 IEEE International Conference on Computer Vision Workshops.

[32]  Fei Yang,et al.  Temporal Spectral Residual for fast salient motion detection , 2012, Neurocomputing.

[33]  Junwei Han Object segmentation from consumer videos: a unified framework based on visual attention , 2009, IEEE Transactions on Consumer Electronics.

[34]  Esa Rahtu,et al.  Segmenting Salient Objects from Images and Videos , 2010, ECCV.

[35]  Mubarak Shah,et al.  Visual attention detection in video sequences using spatiotemporal cues , 2006, MM '06.

[36]  Pierre Baldi,et al.  Bayesian surprise attracts human attention , 2005, Vision Research.

[37]  Edward H. Adelson,et al.  The Design and Use of Steerable Filters , 1991, IEEE Trans. Pattern Anal. Mach. Intell..

[38]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[39]  R. Larsen Box-and-whisker plots , 1985 .

[40]  Neil Gershenfeld,et al.  The nature of mathematical modeling , 1998 .

[41]  Homer H. Chen,et al.  Learning-Based Prediction of Visual Attention for Video Signals , 2011, IEEE Transactions on Image Processing.

[42]  Bernhard Schölkopf,et al.  How to Find Interesting Locations in Video: A Spatiotemporal Interest Point Detector Learned from Human Eye Movements , 2007, DAGM-Symposium.

[43]  Wonjun Kim,et al.  Spatiotemporal Saliency Detection and Its Applications in Static and Dynamic Scenes , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[44]  A. Mizuno,et al.  A change of the leading player in flow Visualization technique , 2006, J. Vis..

[45]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..