Integrating mechanisms of visual guidance in naturalistic language production

Abstract Situated language production requires the integration of visual attention and linguistic processing. Previous work has not conclusively disentangled the role of perceptual scene information and structural sentence information in guiding visual attention. In this paper, we present an eye-tracking study that demonstrates that three types of guidance, perceptual, conceptual, and structural, interact to control visual attention. In a cued language production experiment, we manipulate perceptual (scene clutter) and conceptual guidance (cue animacy) and measure structural guidance (syntactic complexity of the utterance). Analysis of the time course of language production, before and during speech, reveals that all three forms of guidance affect the complexity of visual responses, quantified in terms of the entropy of attentional landscapes and the turbulence of scan patterns, especially during speech. We find that perceptual and conceptual guidance mediate the distribution of attention in the scene, whereas structural guidance closely relates to scan pattern complexity. Furthermore, the eye–voice span of the cued object and its perceptual competitor are similar; its latency mediated by both perceptual and structural guidance. These results rule out a strict interpretation of structural guidance as the single dominant form of visual guidance in situated language production. Rather, the phase of the task and the associated demands of cross-modal cognitive processing determine the mechanisms that guide attention.

[1]  D. Noton,et al.  Eye movements and visual perception. , 1971, Scientific American.

[2]  Joyce Yue Chai,et al.  Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems , 2008, EMNLP.

[3]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[4]  Alex D. Hwang,et al.  Semantic guidance of eye movements in real-world scenes , 2011, Vision Research.

[5]  Zenzi Margareta Griffin,et al.  A technical introduction to using speakers’ eye movements to study language , 2011 .

[6]  Yuanzhen Li,et al.  Measuring visual clutter. , 2007, Journal of vision.

[7]  John M. Findlay,et al.  Visual Attention: The Active Vision Perspective , 2001 .

[8]  A. Kennedy,et al.  Parafoveal-on-foveal effects in normal reading , 2005, Vision Research.

[9]  G. Altmann,et al.  Word meaning and the control of eye fixation: semantic competitor effects and the visual world paradigm , 2005, Cognition.

[10]  Moreno I. Coco,et al.  The Impact of Visual Information on Reference Assignment in Sentence Production , 2009 .

[11]  Holly P. Branigan,et al.  Contributions of animacy to grammatical function assignment and word order during production , 2008 .

[12]  Mary M Hayhoe,et al.  Task and context determine where you look. , 2016, Journal of vision.

[13]  Yuanzhen Li,et al.  Feature congestion: a measure of display clutter , 2005, CHI.

[14]  J. Chai,et al.  User Language Behavior , Domain Knowledge , and Conversation Context in Automatic Word Acquisition for Situated Dialogue in a Virtual World , 2010 .

[15]  Laurent Itti,et al.  Interesting objects are visually salient. , 2008, Journal of vision.

[16]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[17]  Derrick J. Parkhurst,et al.  Modeling the role of salience in the allocation of overt visual attention , 2002, Vision Research.

[18]  Joyce Yue Chai,et al.  Context-based Word Acquisition for Situated Dialogue in a Virtual World , 2014, J. Artif. Intell. Res..

[19]  Michael C. Frank,et al.  Development of infants’ attention to faces during the first year , 2009, Cognition.

[20]  A. Mizuno,et al.  A change of the leading player in flow Visualization technique , 2006, J. Vis..

[21]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[22]  Kumiko Fukumura,et al.  The effect of animacy on the choice of referring expression , 2011 .

[23]  J. Findlay,et al.  Rapid Detection of Person Information in a Naturalistic Scene , 2008, Perception.

[24]  Victor S. Ferreira,et al.  How do speakers avoid ambiguous linguistic expressions? , 2005, Cognition.

[25]  D. E. Irwin,et al.  Minding the clock , 2003 .

[26]  Myriam Chanceaux,et al.  The influence of clutter on real-world scene search: evidence from search efficiency and eye movements. , 2009, Journal of vision.

[27]  Nao Ninomiya,et al.  The 10th anniversary of journal of visualization , 2007, J. Vis..

[28]  A. Kennedy,et al.  On-line contextual influences during reading normal text: A multiple-regression analysis , 2008, Vision Research.

[29]  Michael L. Mack,et al.  Viewing task influences eye movement control during active scene perception. , 2009, Journal of vision.

[30]  Antonio Torralba,et al.  LabelMe: A Database and Web-Based Tool for Image Annotation , 2008, International Journal of Computer Vision.

[31]  M. Tanenhaus,et al.  BRIEF REPORTS Looking at the rope when looking for the snake: Conceptually mediated eye movements during spoken-word recognition , 2005 .

[32]  M. Tanenhaus,et al.  Watching the eyes when talking about size: An investigation of message formulation and utterance planning , 2006 .

[33]  Daniel Marcu,et al.  Learning as search optimization: approximate large margin methods for structured prediction , 2005, ICML.

[34]  J. Henderson,et al.  High-level scene perception. , 1999, Annual review of psychology.

[35]  E. Marshall,et al.  NIMH: caught in the line of fire without a general , 1995, Science.

[36]  Laurence R. Harris,et al.  Vision and Attention , 2001 .

[37]  Christoph Scheepers,et al.  Visual Attention and Structural Choice in Sentence Production Across Languages , 2011, Lang. Linguistics Compass.

[38]  Jennifer E. Arnold,et al.  The effect of additional characters on choice of referring expression: Everyone counts. , 2007, Journal of memory and language.

[39]  Gregory B. Cogan I see what you are saying , 2016, eLife.

[40]  John M Henderson,et al.  I see what you're saying: the integration of complex speech and scenes during language comprehension. , 2011, Acta psychologica.

[41]  G R Loftus,et al.  The functional visual field during picture viewing. , 1980, Journal of experimental psychology. Human learning and memory.

[42]  T. Foulsham,et al.  Quarterly Journal of Experimental Psychology: in press Visual saliency and semantic incongruency influence eye movements when , 2022 .

[43]  G. Zelinsky,et al.  An effect of referential scene constraint on search implies scene segmentation , 2009 .

[44]  M. H. Kelly,et al.  Word and World Order: Semantic, Phonological, and Metrical Determinants of Serial Position , 1993, Cognitive Psychology.

[45]  L. Gleitman,et al.  On the give and take between event apprehension and utterance formulation. , 2007, Journal of memory and language.

[46]  J. Henderson,et al.  Object-based attentional selection in scene viewing. , 2010, Journal of vision.

[47]  C. Koch,et al.  A saliency-based search mechanism for overt and covert shifts of visual attention , 2000, Vision Research.

[48]  K. Fujii,et al.  Visualization for the analysis of fluid motion , 2005, J. Vis..

[49]  Mercè Prat-Sala,et al.  Discourse constraints on syntactic processing in language production , 2000 .

[50]  Joseph H. Danks,et al.  The Eye-Voice Span , 1979 .

[51]  Anna Papafragou,et al.  Does language guide event perception? Evidence from eye movements , 2008, Cognition.

[52]  Paul D. Allopenna,et al.  Tracking the Time Course of Spoken Word Recognition Using Eye Movements: Evidence for Continuous Mapping Models , 1998 .

[53]  G. Altmann,et al.  Incremental interpretation at verbs: restricting the domain of subsequent reference , 1999, Cognition.

[54]  Moreno I. Coco,et al.  Scan Patterns Predict Sentence Production in the Cross-Modal Processing of Visual Scenes , 2012, Cogn. Sci..

[55]  W. Levelt,et al.  Viewing and naming objects: eye movements during noun phrase production , 1998, Cognition.

[56]  D. Bates,et al.  Mixed-Effects Models in S and S-PLUS , 2001 .

[57]  Matthew W. Crocker,et al.  The influence of recent scene events on spoken comprehension: Evidence from eye movements , 2007 .

[58]  Martin J Pickering,et al.  The use of visual context during the production of referring expressions , 2010, Quarterly journal of experimental psychology.

[59]  D. Barr,et al.  Random effects structure for confirmatory hypothesis testing: Keep it maximal. , 2013, Journal of memory and language.

[60]  H. Ritter,et al.  Disambiguating Complex Visual Information: Towards Communication of Personal Views of a Scene , 1996, Perception.

[61]  R. Almond The therapeutic community. , 1971, Scientific American.

[62]  Willem J. M. Levelt,et al.  A theory of lexical access in speech production , 1999, Behavioral and Brain Sciences.

[63]  Kathryn Bock,et al.  Reversing the hands of time: changing the mapping from seeing to saying. , 2011, Journal of experimental psychology. Learning, memory, and cognition.

[64]  Gilbert Ritschard,et al.  Analyzing and Visualizing State Sequences in R with TraMineR , 2011 .

[65]  J. Magnuson,et al.  The time course of anticipatory constraint integration , 2011, Cognition.

[66]  J. Henderson Human gaze control during real-world scene perception , 2003, Trends in Cognitive Sciences.

[67]  Aart C. Liefbroer,et al.  De-standardization of Family-Life Trajectories of Young Adults: A Cross-National Comparison Using Sequence Analysis , 2007 .

[68]  R. Baayen,et al.  Mixed-effects modeling with crossed random effects for subjects and items , 2008 .

[69]  Katsumi Aoki,et al.  Recent development of flow visualization , 2004, J. Vis..