Multimodal Saliency Models for Videos

The modeling of visual attention has been the subject of numerous studies in many different research fields, from neurosciences to computer vision. This interdisciplinary interest led to the publication of a large number of computational models of attention, which emphasize salient regions in visual scenes. The current chapter provides an overview of the evolution of cognitive visual saliency models, from initial low-level models for static images to models for dynamic scenes using more complex features such as text or faces. We focus on saliency models combining multimodal features (especially audio and visual) into a single master saliency map. For instance, when computing the saliency of a video, these models jointly use auditory features extracted from the soundtrack and visual features from the visual frame. This is illustrated with the detailed description of an audiovisual saliency models for videos of conversations. This model includes a speaker diarization algorithm, which automatically modulates the saliency of conversation partners according to their speaking-or-not status. Finally, the chapter ends with some ideas to extend audiovisual saliency modeling to more general scenes featuring various contents.

[1]  R. Baddeley,et al.  Do we look at lights? Using mixture modelling to distinguish between low- and high-level factors in natural image viewing , 2009 .

[2]  John K. Tsotsos,et al.  Modeling Visual Attention via Selective Tuning , 1995, Artif. Intell..

[3]  J. Vroomen,et al.  Perception of intersensory synchrony in audiovisual speech: Not that special , 2011, Cognition.

[4]  A. Norcia,et al.  An objective signature for visual binding of face parts in the human brain. , 2013, Journal of vision.

[5]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[6]  Sébastien M. Crouzet,et al.  Fast saccades toward faces: face detection in just 100 ms. , 2010, Journal of vision.

[7]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[8]  Shaul Hochstein,et al.  At first sight: A high-level pop out effect for faces , 2005, Vision Research.

[9]  Nathalie Guyader,et al.  Improving Visual Saliency by Adding ‘Face Feature Map’ and ‘Center Bias’ , 2012, Cognitive Computation.

[10]  Antoine Coutrot,et al.  Influence of soundtrack on eye movements during video exploration , 2012 .

[11]  Eric Vatikiotis-Bateson,et al.  Audiovisual Speech Processing: Contributors , 2012 .

[12]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[13]  Petros Maragos,et al.  Multimodal Saliency and Fusion for Movie Summarization Based on Aural, Visual, and Textual Attention , 2013, IEEE Transactions on Multimedia.

[14]  Michael L. Mack,et al.  Viewing task influences eye movement control during active scene perception. , 2009, Journal of vision.

[15]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[16]  C. Spence Crossmodal correspondences: A tutorial review , 2011, Attention, perception & psychophysics.

[17]  S. Hillyard,et al.  Involuntary orienting to sound improves visual perception , 2000, Nature.

[18]  Antoine Coutrot,et al.  Toward the introduction of auditory information in dynamic visual attention models , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[19]  Patrick Le Callet,et al.  A coherent computational approach to model bottom-up visual attention , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  M. Bindemann,et al.  Faces retain attention , 2005, Psychonomic bulletin & review.

[21]  Antoine Coutrot,et al.  Video viewing: do auditory salient events capture visual attention? , 2013, annals of telecommunications - annales des télécommunications.

[22]  L. Itti,et al.  Quantifying center bias of observers in free viewing of dynamic natural scenes. , 2009, Journal of vision.

[23]  Gérard Bailly,et al.  Gaze, conversational agents and face-to-face communication , 2010, Speech Commun..

[24]  Olivier Le Meur,et al.  A Time-Dependent Saliency Model Combining Center and Depth Biases for 2D and 3D Viewing Conditions , 2012, Cognitive Computation.

[25]  A. Coutrot,et al.  How saliency, faces, and sound influence gaze in dynamic social scenes. , 2014, Journal of vision.

[26]  A J Van Opstal,et al.  Auditory-visual interactions subserving goal-directed saccades in a complex scene. , 2002, Journal of neurophysiology.

[27]  Susana T. L. Chung,et al.  Ideal observer analysis of crowding and the reduction of crowding through learning. , 2010, Journal of vision.

[28]  Cees van Leeuwen,et al.  A pragmatic approach to multi-modality and non-normality in fixation duration studies of cognitive processes , 2008 .

[29]  Jan Theeuwes,et al.  When Are Attention and Saccade Preparation Dissociated? , 2009, Psychological science.

[30]  Danilo De Rossi,et al.  Designing and Evaluating a Social Gaze-Control System for a Humanoid Robot , 2014, IEEE Transactions on Human-Machine Systems.

[31]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  M. Doherty,et al.  The control of attention to faces. , 2007, Journal of vision.

[33]  M. Farah,et al.  What is "special" about face perception? , 1998, Psychological review.

[34]  Tom Foulsham,et al.  Look who's talking? Sound changes gaze behaviour in a dynamic social scene , 2013 .

[35]  G. T. Buswell How People Look At Pictures: A Study Of The Psychology Of Perception In Art , 2012 .

[36]  Linda Lundström,et al.  The pupils and optical systems of gecko eyes. , 2009, Journal of vision.

[37]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[38]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[39]  E. Vatikiotis-Bateson,et al.  Eye movement of perceivers during audiovisualspeech perception , 1998, Perception & psychophysics.

[40]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[41]  Charissa R Lansing,et al.  Word identification and eye fixation locations in visual and visual-plus-auditory presentations of spoken sentences , 2003, Perception & psychophysics.

[42]  Andrew Hollingworth,et al.  Eye Movements During Scene Viewing: An Overview , 1998 .

[43]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[44]  G. Rizzolatti,et al.  Reorienting attention across the horizontal and vertical meridians: Evidence in favor of a premotor theory of attention , 1987, Neuropsychologia.

[45]  Petros Maragos,et al.  A saliency-based approach to audio event detection and summarization , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[46]  J. Henderson,et al.  Do the eyes really have it? Dynamic allocation of attention when viewing moving faces. , 2012, Journal of vision.

[47]  P. König,et al.  Audio-visual integration during overt visual attention , 2008 .

[48]  Petros Maragos,et al.  Video event detection and summarization using audio, visual and text saliency , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[50]  A. L. I︠A︡rbus Eye Movements and Vision , 1967 .

[51]  Nathalie Guyader,et al.  A Functional and Statistical Bottom-Up Saliency Model to Reveal the Relative Contributions of Low-Level Visual Guiding Factors , 2010, Cognitive Computation.

[52]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[53]  Alexandre Bernardino,et al.  Multimodal saliency-based bottom-up attention a framework for the humanoid robot iCub , 2008, 2008 IEEE International Conference on Robotics and Automation.

[54]  Rainer Stiefelhagen,et al.  Multimodal saliency-based attention for object-based scene analysis , 2011, 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems.

[55]  Tom Foulsham,et al.  Gaze allocation in a dynamic situation: Effects of social status and speaking , 2010, Cognition.

[56]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[57]  Peter Wittenburg,et al.  The gesturer is the speaker , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[58]  Hans Colonius,et al.  Two stages in crossmodal saccadic integration: evidence from a visual-auditory focused attention task , 2003, Experimental Brain Research.

[59]  Riitta Hari,et al.  Influence of Turn-Taking in a Two-Person Conversation on the Gaze of a Viewer , 2013, PloS one.

[60]  Radu Horaud,et al.  2D sound-source localization on the binaural manifold , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[61]  A. Kingstone,et al.  Saliency does not account for fixations to eyes within social scenes , 2009, Vision Research.

[62]  Jan Theeuwes,et al.  Pip and pop: nonspatial auditory signals improve spatial visual search. , 2008, Journal of experimental psychology. Human perception and performance.

[63]  Petros Maragos,et al.  An Audio-Visual Saliency Model for Movie Summarization , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[64]  J. Todd,et al.  The effects of viewing angle, camera angle, and sign of surface curvature on the perception of three-dimensional shape from texture. , 2007, Journal of vision.

[65]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[66]  Marisa Carrasco,et al.  Attention enhances contrast appearance via increased baseline neural responses , 2014 .

[67]  D. Pellerin,et al.  Different types of sounds influence gaze differently in videos , 2013 .

[68]  J. Theeuwes,et al.  Faces capture attention: Evidence from inhibition of return , 2006 .

[69]  A. Coutrot,et al.  An efficient audiovisual saliency model to predict eye positions when looking at conversations , 2015, 2015 23rd European Signal Processing Conference (EUSIPCO).

[70]  D. McNeill So you think gestures are nonverbal , 1985 .

[71]  T. Foulsham,et al.  Comparing scanpaths during scene encoding and recognition : A multi-dimensional approach , 2012 .

[72]  Zhi Liu,et al.  Saccadic model of eye movements for free-viewing condition , 2015, Vision Research.