Audio-visual saliency map: Overview, basic models and hardware implementation

In this paper we provide an overview of audiovisual saliency map models. In the simplest model, the location of auditory source is modeled as a Gaussian and use different methods of combining the auditory and visual information. We then provide experimental results with applications of simple audio-visual integration models for cognitive scene analysis. We validate the simple audio-visual saliency models with a hardware convolutional network architecture and real data recorded from moving audio-visual objects. The latter system was developed under Torch language by extending the attention.lua (code) and attention.ui (GUI) files that implement Culurciello's visual attention model.

[1]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.

[2]  Alexandre Bernardino,et al.  Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[3]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[4]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[5]  T. Stanford,et al.  Multisensory integration: current issues from the perspective of the single neuron , 2008, Nature Reviews Neuroscience.

[6]  A. Murat Tekalp,et al.  Multimodal speaker/speech recognition using lip motion, lip texture and audio , 2006, Signal Process..

[7]  P. König,et al.  Audio-visual integration during overt visual attention , 2008 .

[8]  S. Hillyard,et al.  Involuntary orienting to sound improves visual perception , 2000, Nature.

[9]  F. Harris On the use of windows for harmonic analysis with the discrete Fourier transform , 1978, Proceedings of the IEEE.

[10]  Peter König,et al.  Integrating audiovisual information for the control of overt attention. , 2007, Journal of vision.

[11]  B. Stein,et al.  Determinants of multisensory integration in superior colliculus neurons. I. Temporal factors , 1987, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[12]  Sadaoki Furui,et al.  TOWARD ROBUST MULTIMODAL SPEECH RECOGNITION , 2005 .

[13]  C. Spence,et al.  The Handbook of Multisensory Processing , 2004 .

[14]  S. Grossberg,et al.  A Neural Model of Multimodal Adaptive Saccadic Eye Movement Control by Superior Colliculus , 1997, The Journal of Neuroscience.

[15]  Stefan Mihalas,et al.  A model of proto-object based saliency , 2014, Vision Research.

[16]  P. Mamassian,et al.  Multisensory processing in review: from physiology to behaviour. , 2010, Seeing and perceiving.

[17]  Ryan A. Stevenson,et al.  Audiovisual integration in human superior temporal sulcus: Inverse effectiveness and the neural processing of speech and object recognition , 2009, NeuroImage.

[18]  Jinglong Wu,et al.  Task-irrelevant auditory stimuli affect audiovisual integration in a visual attention task: Evidence from event-related potentials , 2011, The 2011 IEEE/ICME International Conference on Complex Medical Engineering.

[19]  schauerte kuehn,et al.  A Modular Audio-Visual Scene Analysis and Attention System for Humanoid Robots , 2012 .

[20]  Trevor Darrell,et al.  Audiovisual arrays for untethered spoken interfaces , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[21]  Ryan A. Stevenson,et al.  Superadditive BOLD activation in superior temporal sulcus with threshold non-speech objects , 2007, Experimental Brain Research.

[22]  B. Stein,et al.  Visual, auditory, and somatosensory convergence on cells in superior colliculus results in multisensory integration. , 1986, Journal of neurophysiology.

[23]  B. Stein,et al.  Spatial determinants of multisensory integration in cat superior colliculus neurons. , 1996, Journal of neurophysiology.