Audio Matters in Visual Attention

There is a dearth of information on how perceived auditory information guides image-viewing behavior. To investigate auditory-driven visual attention, we first generated a human eye-fixation database from a pool of 200 static images and 400 image-audio pairs viewed by 48 subjects. The eye tracking data for the image-audio pairs were captured while participants viewed images, which took place immediately after exposure to coherent/incoherent audio samples. The database was analyzed in terms of time to first fixation, fixation durations on the target object, entropy, AUC, and saliency ratio. It was found that coherent audio information is an important cue for enhancing the feature-specific response to the target object. Conversely, incoherent audio information attenuates this response. Finally, a system predicting the image-viewing with the influence of different audio sources was developed. The detailedly discussed top-down module in the system is composed of auditory estimation based on Gaussian mixture model-maximum a posteriori algorithm-universal background model structure, as well as visual estimation based on the conditional random field model and sparse latent variables. The evaluation experiments show that the proposed models in the system exhibit strong consistency with eye fixations.

[1]  C. Koch,et al.  Faces and text attract gaze independent of the task: Experimental data and computer model. , 2009, Journal of vision.

[2]  S. Süsstrunk,et al.  Frequency-tuned salient region detection , 2009, CVPR 2009.

[3]  Ellen Winner,et al.  "Metaphorical" Mapping in Human Infants , 1981 .

[4]  Christof Koch,et al.  Feature combination strategies for saliency-based visual attention systems , 2001, J. Electronic Imaging.

[5]  Lawrence E. Marks,et al.  Visual-auditory interaction in speeded classification: Role of stimulus difference , 1995, Perception & psychophysics.

[6]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[7]  Daphne Maurer,et al.  Do small white balls squeak? Pitch-object correspondences in young children , 2004, Cognitive, affective & behavioral neuroscience.

[8]  J. Cohen,et al.  Dissociating the role of the dorsolateral prefrontal and anterior cingulate cortex in cognitive control. , 2000, Science.

[9]  Pietro Perona,et al.  Graph-Based Visual Saliency , 2006, NIPS.

[10]  John K. Tsotsos,et al.  Saliency, attention, and visual search: an information theoretic approach. , 2009, Journal of vision.

[11]  Meng Wang,et al.  Spectral Hashing With Semantically Consistent Graph for Image Indexing , 2013, IEEE Transactions on Multimedia.

[12]  G. Underwood,et al.  Low-level visual saliency does not predict change detection in natural scenes. , 2007, Journal of vision.

[13]  Jan Theeuwes,et al.  Pip and pop: nonspatial auditory signals improve spatial visual search. , 2008, Journal of experimental psychology. Human perception and performance.

[14]  Matthijs C. Dorst Distinctive Image Features from Scale-Invariant Keypoints , 2011 .

[15]  J. Rieser,et al.  Attention and communication: Eye-movement-based research paradigms , 1996 .

[16]  John K. Tsotsos,et al.  Saliency Based on Information Maximization , 2005, NIPS.

[17]  N. Mackworth,et al.  Cognitive determinants of fixation location during picture viewing. , 1978, Journal of experimental psychology. Human perception and performance.

[18]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[19]  Robert B. Fisher,et al.  Object-based visual attention for computer vision , 2003, Artif. Intell..

[20]  L E Marks,et al.  Perceptual and Linguistic Interactions in Speeded Classification: Tests of the Semantic Coding Hypothesis , 1999, Perception.

[21]  Jun Yu,et al.  Click Prediction for Web Image Reranking Using Multimodal Sparse Coding , 2014, IEEE Transactions on Image Processing.

[22]  Yue Gao,et al.  Beyond Text QA: Multimedia Answer Generation by Harvesting Web Information , 2013, IEEE Transactions on Multimedia.

[23]  Ming-Hsuan Yang,et al.  Top-down visual saliency via joint CRF and dictionary learning , 2012, CVPR.

[24]  Jan Theeuwes,et al.  Early multisensory interactions affect the competition among multiple visual objects , 2011, NeuroImage.

[25]  Peter König,et al.  Integrating audiovisual information for the control of overt attention. , 2007, Journal of vision.

[26]  Petros Maragos,et al.  Audiovisual Attention Modeling and Salient Event Detection , 2008, Multimodal Processing and Interaction.

[27]  R. Proctor,et al.  Attention: Theory and Practice , 2003 .

[28]  de Gelder Sound Enhances Visual Perception: Cross-Modal Effects of Auditory Organization on Vision , 2001 .

[29]  Qi Tian,et al.  Saliency Density Maximization for Efficient Visual Objects Discovery , 2011, IEEE Transactions on Circuits and Systems for Video Technology.

[30]  Nanning Zheng,et al.  Learning to Detect a Salient Object , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[31]  J. Wolfe Visual memory: What do you know about what you saw? , 1998, Current Biology.

[32]  Linda B. Smith,et al.  A developmental analysis of the polar structure of dimensions , 1992, Cognitive Psychology.

[33]  Meng Wang,et al.  Movie2Comics: Towards a Lively Video Content Presentation , 2012, IEEE Transactions on Multimedia.

[34]  Nuno Vasconcelos,et al.  Discriminant Saliency, the Detection of Suspicious Coincidences, and Applications to Visual Recognition , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[35]  Matthew H Tong,et al.  SUN: Top-down saliency using natural statistics , 2009, Visual cognition.

[36]  Christof Koch,et al.  Image Signature: Highlighting Sparse Salient Regions , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[37]  Shrikanth S. Narayanan,et al.  A saliency-based auditory attention model with applications to unsupervised prominent syllable detection in speech , 2007, INTERSPEECH.

[38]  Liqing Zhang,et al.  Dynamic visual attention: searching for coding length increments , 2008, NIPS.

[39]  Douglas A. Reynolds,et al.  Speaker Verification Using Adapted Gaussian Mixture Models , 2000, Digit. Signal Process..