How saliency, faces, and sound influence gaze in dynamic social scenes.

Conversation scenes are a typical example in which classical models of visual attention dramatically fail to predict eye positions. Indeed, these models rarely consider faces as particular gaze attractors and never take into account the important auditory information that always accompanies dynamic social scenes. We recorded the eye movements of participants viewing dynamic conversations taking place in various contexts. Conversations were seen either with their original soundtracks or with unrelated soundtracks (unrelated speech and abrupt or continuous natural sounds). First, we analyze how auditory conditions influence the eye movement parameters of participants. Then, we model the probability distribution of eye positions across each video frame with a statistical method (Expectation-Maximization), allowing the relative contribution of different visual features such as static low-level visual saliency (based on luminance contrast), dynamic low level visual saliency (based on motion amplitude), faces, and center bias to be quantified. Through experimental and modeling results, we show that regardless of the auditory condition, participants look more at faces, and especially at talking faces. Hearing the original soundtrack makes participants follow the speech turn-taking more closely. However, we do not find any difference between the different types of unrelated soundtracks. These eyetracking results are confirmed by our model that shows that faces, and particularly talking faces, are the features that best explain the gazes recorded, especially in the original soundtrack condition. Low-level saliency is not a relevant feature to explain eye positions made on social scenes, even dynamic ones. Finally, we propose groundwork for an audiovisual saliency model.

[1]  R. C. Langford How People Look at Pictures, A Study of the Psychology of Perception in Art. , 1936 .

[2]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[3]  A. L. I︠A︡rbus Eye Movements and Vision , 1967 .

[4]  A. L. Yarbus,et al.  Eye Movements and Vision , 1967, Springer US.

[5]  L. Stark,et al.  Most naturally occurring human saccades have magnitudes of 15 degrees or less. , 1975, Investigative ophthalmology.

[6]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[7]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[8]  G. K. Noorden Movements of the Eyes , 1978 .

[9]  A. Treisman,et al.  A feature-integration theory of attention , 1980, Cognitive Psychology.

[10]  S Ullman,et al.  Shifts in selective visual attention: towards the underlying neural circuitry. , 1985, Human neurobiology.

[11]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[12]  R. Carpenter Movements of the eyes, 2nd rev. & enlarged ed. , 1988 .

[13]  Michel Chion,et al.  Audio-Vision: Sound on Screen , 1994 .

[14]  R. Campbell,et al.  Hearing by eye 2 : advances in the psychology of speechreading and auditory-visual speech , 1997 .

[15]  N. Kanwisher,et al.  The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception , 1997, The Journal of Neuroscience.

[16]  M. Farah,et al.  What is "special" about face perception? , 1998, Psychological review.

[17]  E. Vatikiotis-Bateson,et al.  Eye movement of perceivers during audiovisualspeech perception , 1998, Perception & psychophysics.

[18]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[19]  J. Haxby,et al.  The distributed human neural system for face perception , 2000, Trends in Cognitive Sciences.

[20]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[21]  Qing Yang,et al.  The latency of saccades, vergence, and combined eye movements in children and in adults. , 2002, Investigative ophthalmology & visual science.

[22]  Charissa R Lansing,et al.  Word identification and eye fixation locations in visual and visual-plus-auditory presentations of spoken sentences , 2003, Perception & psychophysics.

[23]  Frédéric Berthommier,et al.  A phonetically neutral model of the low-level audio-visual interaction , 2004, Speech Commun..

[24]  M. Boltz The cognitive processing of film and musical soundtracks , 2004, Memory & cognition.

[25]  Shaul Hochstein,et al.  At first sight: A high-level pop out effect for faces , 2005, Vision Research.

[26]  Annabel J. Cohen How music influences the interpretation of film and video , 2005 .

[27]  Michael T. Lippert,et al.  Mechanisms for Allocating Auditory Attention: An Auditory Saliency Map , 2005, Current Biology.

[28]  M. Bindemann,et al.  Faces retain attention , 2005, Psychonomic bulletin & review.

[29]  R. Baddeley,et al.  The long and the short of it: Spatial statistics at fixation vary with saccade amplitude and task , 2006, Vision Research.

[30]  J. Theeuwes,et al.  Faces capture attention: Evidence from inhibition of return , 2006 .

[31]  C. Spence,et al.  Audiovisual synchrony perception for music, speech, and object actions , 2006, Brain Research.

[32]  O. Meur,et al.  Predicting visual fixations on video based on low-level visual features , 2007, Vision Research.

[33]  K. Munhall,et al.  Spatial statistics of gaze fixations during dynamic face processing , 2007, Social neuroscience.

[34]  Christof Koch,et al.  Predicting human gaze using low-level saliency combined with face detection , 2007, NIPS.

[35]  M. Doherty,et al.  The control of attention to faces. , 2007, Journal of vision.

[36]  Benjamin W Tatler,et al.  The central fixation bias in scene viewing: selecting an optimal viewing position independently of motor biases and image feature distributions. , 2007, Journal of vision.

[37]  Peter König,et al.  Integrating audiovisual information for the control of overt attention. , 2007, Journal of vision.

[38]  P. König,et al.  Audio-visual integration during overt visual attention , 2008 .

[39]  Marcus Nyström,et al.  Semantic override of low-level features in image viewing - both initially and overall , 2008 .

[40]  Daniel C. Richardson,et al.  Synchrony and swing in conversation: coordination, temporal dynamics and communication , 2008 .

[41]  J. Helmert,et al.  Visual Fixation Durations and Saccade Amplitudes: Shifting Relationship in a Variety of Conditions , 2008 .

[42]  Christof Koch,et al.  A Model of Saliency-Based Visual Attention for Rapid Scene Analysis , 2009 .

[43]  R. Baddeley,et al.  Do we look at lights? Using mixture modelling to distinguish between low- and high-level factors in natural image viewing , 2009 .

[44]  Michael C. Frank,et al.  Development of infants’ attention to faces during the first year , 2009, Cognition.

[45]  Nathalie Guyader,et al.  Modelling Spatio-Temporal Saliency to Predict Gaze Direction for Short Videos , 2009, International Journal of Computer Vision.

[46]  L. Itti,et al.  Quantifying center bias of observers in free viewing of dynamic natural scenes. , 2009, Journal of vision.

[47]  A. Kingstone,et al.  Saliency does not account for fixations to eyes within social scenes , 2009, Vision Research.

[48]  John M. Henderson,et al.  Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion , 2011, Cognitive Computation.

[49]  Sébastien M. Crouzet,et al.  Fast saccades toward faces: face detection in just 100 ms. , 2010, Journal of vision.

[50]  Edward Branigan Soundtrack in Mind , 2010 .

[51]  Tom Foulsham,et al.  Gaze allocation in a dynamic situation: Effects of social status and speaking , 2010, Cognition.

[52]  Gérard Bailly,et al.  Gaze, conversational agents and face-to-face communication , 2010, Speech Commun..

[53]  Nathalie Guyader,et al.  A Functional and Statistical Bottom-Up Saliency Model to Reveal the Relative Contributions of Low-Level Visual Guiding Factors , 2010, Cognitive Computation.

[54]  David H Brainard,et al.  Surface color perception and equivalent illumination models. , 2011, Journal of vision.

[55]  D. Ballard,et al.  Eye guidance in natural vision: reinterpreting salience. , 2011, Journal of vision.

[56]  J. Vroomen,et al.  Perception of intersensory synchrony in audiovisual speech: Not that special , 2011, Cognition.

[57]  Christian Breiteneder,et al.  Cross-Modal Analysis of Audio-Visual Film Montage , 2011, International Conference on Computer Communications and Networks.

[58]  Antoine Coutrot,et al.  Influence of soundtrack on eye movements during video exploration , 2012 .

[59]  Pascal Bertolino Sensarea: An authoring tool to create accurate clickable videos , 2012, 2012 10th International Workshop on Content-Based Multimedia Indexing (CBMI).

[60]  Frédéric Berthommier,et al.  Binding and unbinding the auditory and visual streams in the McGurk effect. , 2012, The Journal of the Acoustical Society of America.

[61]  Nicholas W. D. Evans,et al.  Speaker Diarization: A Review of Recent Research , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[62]  J. Henderson,et al.  Do the eyes really have it? Dynamic allocation of attention when viewing moving faces. , 2012, Journal of vision.

[63]  Olivier Le Meur,et al.  A Time-Dependent Saliency Model Combining Center and Depth Biases for 2D and 3D Viewing Conditions , 2012, Cognitive Computation.

[64]  Gwenn Englebienne,et al.  Multimodal Speaker Diarization , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[65]  J. Badler,et al.  Divergence between oculomotor and perceptual causality. , 2012, Journal of vision.

[66]  Nathalie Guyader,et al.  Improving Visual Saliency by Adding ‘Face Feature Map’ and ‘Center Bias’ , 2012, Cognitive Computation.

[67]  Ton Kalker,et al.  Voice activity detection and speaker localization using audiovisual cues , 2012, Pattern Recognit. Lett..

[68]  T. Smith,et al.  Attentional synchrony and the influence of viewing task on gaze behavior in static and dynamic scenes. , 2013, Journal of vision.

[69]  Lihi Zelnik-Manor,et al.  Learning Video Saliency from Human Gaze Using Candidate Selection , 2013, 2013 IEEE Conference on Computer Vision and Pattern Recognition.

[70]  A. Norcia,et al.  An objective signature for visual binding of face parts in the human brain. , 2013, Journal of vision.

[71]  Riitta Hari,et al.  Influence of Turn-Taking in a Two-Person Conversation on the Gaze of a Viewer , 2013, PloS one.

[72]  D. Pellerin,et al.  Different types of sounds influence gaze differently in videos , 2013 .

[73]  Eileen Kowler,et al.  Eye movements while viewing narrated, captioned, and silent videos. , 2013, Journal of vision.

[74]  Antoine Coutrot,et al.  Toward the introduction of auditory information in dynamic visual attention models , 2013, 2013 14th International Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS).

[75]  Thierry Baccino,et al.  Methods for comparing scanpaths and saliency maps: strengths and weaknesses , 2012, Behavior Research Methods.

[76]  Ali Borji,et al.  State-of-the-Art in Visual Attention Modeling , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[77]  Tom Foulsham,et al.  Look who's talking? Sound changes gaze behaviour in a dynamic social scene , 2013 .