Fixation prediction through multimodal analysis

In this paper, we propose to predict human fixations by incorporating both audio and visual cues. Traditional visual attention models generally make the utmost of stimuli's visual features, while discarding all audio information. But in the real world, we human beings not only direct our gaze according to visual saliency but also may be attracted by some salient audio. Psychological experiments show that audio may have some influence on visual attention, and subjects tend to be attracted the sound sources. Therefore, we propose to fuse both audio and visual information to predict fixations. In our framework, we first localize the moving-sounding objects through multimodal analysis and generate an audio attention map, in which greater value denotes higher possibility of a position being the sound source. Then we calculate the spatial and temporal attention maps using only the visual modality. At last, the audio, spatial and temporal attention maps are fused, generating our final audio-visual saliency map. We gather a set of videos and collect eye-tracking data under audio-visual test conditions. Experiment results show that we can achieve better performance when considering both audio and visual cues.

[1]  C. Koch,et al.  Faces and text attract gaze independent of the task: Experimental data and computer model. , 2009, Journal of vision.

[2]  Kien A. Hua,et al.  What's Making that Sound? , 2014, ACM Multimedia.

[3]  Yoav Y. Schechner,et al.  Harmony in Motion , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Philip H. S. Torr,et al.  BING: Binarized normed gradients for objectness estimation at 300fps , 2014, Computational Visual Media.

[5]  Cláudio Rosito Jung,et al.  Simultaneous-Speaker Voice Activity Detection and Localization Using Mid-Fusion of SVM and HMMs , 2014, IEEE Transactions on Multimedia.

[6]  Mubarak Shah,et al.  Visual attention detection in video sequences using spatiotemporal cues , 2006, MM '06.

[7]  Antoine Coutrot,et al.  Toward the introduction of auditory information in dynamic visual attention models , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[8]  Christel Chamaret,et al.  Spatio-temporal combination of saliency maps and eye-tracking assessment of different strategies , 2010, 2010 IEEE International Conference on Image Processing.

[9]  Stan Sclaroff,et al.  Saliency Detection: A Boolean Map Approach , 2013, 2013 IEEE International Conference on Computer Vision.

[10]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[11]  D. Pellerin,et al.  Different types of sounds influence gaze differently in videos , 2013 .

[12]  Sabine Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[13]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[14]  Patrick Le Callet,et al.  Does where you Gaze on an Image Affect your Perception of Quality? Applying Visual Attention to Image Quality Metric , 2007, 2007 IEEE International Conference on Image Processing.

[15]  Mubarak Shah,et al.  Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects , 2013, IEEE Transactions on Multimedia.

[16]  John Shawe-Taylor,et al.  Canonical Correlation Analysis: An Overview with Application to Learning Methods , 2004, Neural Computation.

[17]  Asha Iyer,et al.  Components of bottom-up gaze allocation in natural images , 2005, Vision Research.

[18]  Ali Borji,et al.  Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study , 2013, IEEE Transactions on Image Processing.

[19]  Frédo Durand,et al.  Learning to predict where humans look , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[20]  Laurent Itti,et al.  Automatic foveation for video compression using a neurobiological model of visual attention , 2004, IEEE Transactions on Image Processing.

[21]  J. Vroomen,et al.  Sound enhances visual perception: cross-modal effects of auditory organization on vision. , 2000, Journal of experimental psychology. Human perception and performance.

[22]  Chang-Su Kim,et al.  Spatiotemporal Saliency Detection for Video Sequences Based on Random Walk With Restart , 2015, IEEE Transactions on Image Processing.

[23]  Xiongkuo Min,et al.  Sound influences visual attention discriminately in videos , 2014, 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX).

[24]  W. J. Langford Statistical Methods , 1959, Nature.

[25]  Antoine Coutrot,et al.  Influence of soundtrack on eye movements during video exploration , 2012 .

[26]  Shi-Min Hu,et al.  Global contrast based salient region detection , 2011, CVPR 2011.

[27]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[28]  Antoine Coutrot,et al.  An audiovisual attention model for natural conversation scenes , 2014, 2014 IEEE International Conference on Image Processing (ICIP).

[29]  Wenjun Zhang,et al.  Automatic Contrast Enhancement Technology With Saliency Preservation , 2015, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Peyman Milanfar,et al.  Static and space-time visual saliency detection by self-resemblance. , 2009, Journal of vision.

[31]  Michael Elad,et al.  Pixels that sound , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[32]  S. Yantis,et al.  Abrupt visual onsets and selective attention: voluntary versus automatic allocation. , 1990, Journal of experimental psychology. Human perception and performance.

[33]  Liming Zhang,et al.  Spatio-temporal Saliency detection using phase spectrum of quaternion fourier transform , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[34]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[35]  Thomas Deselaers,et al.  Measuring the Objectness of Image Windows , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[36]  Touradj Ebrahimi,et al.  Subjective Quality Evaluation of Foveated Video Coding Using Audio-Visual Focus of Attention , 2011, IEEE Journal of Selected Topics in Signal Processing.

[37]  L A JEFFRESS,et al.  A place theory of sound localization. , 1948, Journal of comparative and physiological psychology.

[38]  Michael Elad,et al.  Cross-Modal Localization via Sparsity , 2007, IEEE Transactions on Signal Processing.

[39]  Mohan S. Kankanhalli,et al.  Audio Matters in Visual Attention , 2014, IEEE Transactions on Circuits and Systems for Video Technology.

[40]  K. Saberi,et al.  Auditory psychomotor coordination and visual search performance , 1990, Perception & psychophysics.

[41]  Romit Roy Choudhury,et al.  MoVi: mobile phone based video highlights via collaborative sensing , 2010, MobiSys '10.

[42]  Martin D. Levine,et al.  Visual Saliency Based on Scale-Space Analysis in the Frequency Domain , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Weisi Lin,et al.  Visual Saliency Detection With Free Energy Theory , 2015, IEEE Signal Processing Letters.

[44]  Liqing Zhang,et al.  Saliency Detection: A Spectral Residual Approach , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[45]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[46]  Aykut Erdem,et al.  Visual saliency estimation by nonlinearly integrating features using region covariances. , 2013, Journal of vision.

[47]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[48]  Lie Lu,et al.  A generic framework of user attention model and its application in video summarization , 2005, IEEE Trans. Multim..

[49]  Tim K Marks,et al.  SUN: A Bayesian framework for saliency using natural statistics. , 2008, Journal of vision.

[50]  Chenliang Xu,et al.  Streaming Hierarchical Video Segmentation , 2012, ECCV.

[51]  Meinard Müller,et al.  Information retrieval for music and motion , 2007 .

[52]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[53]  Ce Liu,et al.  Exploring new representations and applications for motion analysis , 2009 .