Modeling Utterance-mediated Attention in Situated Language Comprehension

Modeling Utterance-mediated Attention in Situated Language Comprehension J´ an Svantner (svantner@fmph.uniba.sk) Igor Farkaˇ s (farkas@fmph.uniba.sk) Department of Applied Informatics, Comenius University Mlynsk´ a dolina, 824 48 Bratislava, Slovakia Matthew Crocker (crocker@coli.uni-sb.de) Department of Computational Linguistics and Phonetics, Saarland University 66123 Saarbr¨ ucken, Germany Abstract findings, claiming that situated language comprehension is incremental, anticipatory, integrative, adaptive, and coordinated, which led to the proposal of the coordinated interplay account (Cia). Empirical evidence from studies using the visual world paradigm reveals that spoken language guides atten- tion in a related visual scene and that scene informa- tion can influence the comprehension process. Here we model sentence comprehension using the visual context. A recurrent neural network is trained to associate the linguistic input with the visual scene and to produce the interpretation of the described event. The feedback mechanism in the form of sigma-pi connection is added to model the explicit utterance-mediated visual atten- tion behavior revealed by the visual world paradigm. The results show that the network successfully learns sentence final interpretation and also demonstrates the hallmark anticipation behavior of predicting upcoming constituents. Keywords: connectionist modeling; sentence compre- hension; attentional mechanism; visual scene Introduction During the last decade, research in human language com- prehension has progressed well beyond the examination of the syntactic and semantic properties of words and sentences considered in isolation. Detailed on-line evi- dence for how people comprehend visually-situated lan- guage has come from the visual world paradigm (see Huettig, Rommers, and Meyer (2011) for a recent re- view). The visual world paradigm takes advantage of the listeners’ tendency to look at relevant elements of the visual scene as they are mentioned or anticipated (which is typically measured by eye-tracking). Specifi- cally, it has been shown that spoken language can guide attention in a related visual scene and that scene in- formation can immediately influence the comprehension process (Tanenhaus, Spivey-Knowlton, Eberhard, & Se- divy, 1995). Findings have revealed the rapid and in- cremental influence of visual referential context (Spivey, Tanenhaus, Eberhard, & Sedivy, 2002; Tanenhaus et al., 1995) and depicted events (Knoeferle, Crocker, Scheep- ers, & Pickering, 2005) on ambiguity resolution in online- situated utterance processing. Further research demon- strated that listeners even anticipate likely upcoming role fillers in the scene based on their linguistic and gen- eral knowledge (e.g. Kamide, Altmann, and Haywood (2003)). Knoeferle and Crocker (2006) identified several cognitive characteristics based on the above mentioned The recent CiaNet model (Mayberry, Crocker, & Knoeferle, 2009) instantiates the Cia proposal and ac- counts for a range of empirical findings. CiaNet is a recurrent sigma-pi neural network that models the rapid use of scene information, exploiting an utterance- mediated attentional mechanism. The model was shown to achieve very good performance (both with and with- out scene contexts), while also exhibiting hallmark be- haviors of situated comprehension, such as incremen- tal processing, anticipation of appropriate role fillers, as well as the immediate use and priority of depicted event information through the coordinated use of utterance- mediated attention to the scene. Several other models that link language with the visual world, do exist, includ- ing those mentioned in the very recent review (Huettig et al., 2011), as well as Yu, Ballard, and Aslin (2005); Gold and Scassellati (2007). These models emphasize situated lexical learning and processing, however, and there re- main very few attempts to model the compositional and incremental nature of visually situated sentence compre- hension. Inspired by above mentioned CiaNet, we investigate a more general network architecture that also learns to adapt the attention mechanism to help the network fo- cus on (and predict upcoming) relevant constituents and in principle allows generalization to more complex scenes (the attention mechanism in CiaNet is restricted to fa- vor one of the two concurrent events). Our model also differs from CiaNet (and other models) in that inhibi- tion operates at both the object and event levels (rather than only at the event level) that are assumed to under- lie the cognitive representation of the visual scene. In addition, our work assumes that visually grounded lex- ical representations are in place, focusing rather on the compositional aspects of situated sentence comprehen- sion.

[1]  Chen Yu,et al.  The Role of Embodied Intention in Early Lexical Acquisition , 2005, Cogn. Sci..

[2]  Brian Scassellati,et al.  A Robot That Uses Existing Vocabulary to Infer Non-Visual Word Meanings from Observation , 2007, AAAI.

[3]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[4]  James L. McClelland,et al.  Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations , 1986 .

[5]  Julie C. Sedivy,et al.  Eye movements and spoken language comprehension: Effects of visual context on syntactic ambiguity resolution , 2002, Cognitive Psychology.

[6]  A. Meyer,et al.  Using the visual world paradigm to study language processing: a review and critical evaluation. , 2011, Acta psychologica.

[7]  Marshall R. Mayberry,et al.  Learning to Attend: A Connectionist Model of Situated Language Comprehension , 2009, Cogn. Sci..

[8]  E. Knudsen Fundamental components of attention. , 2007, Annual review of neuroscience.

[9]  Teuvo Kohonen,et al.  The self-organizing map , 1990 .

[10]  Julie C. Sedivy,et al.  Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning , 1995 .

[11]  Matthew W. Crocker,et al.  The influence of the immediate visual context on incremental thematic role-assignment: evidence from eye-movements in depicted events , 2005, Cognition.

[12]  Matthew W. Crocker,et al.  The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking , 2006, Cogn. Sci..

[13]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[14]  G. Altmann,et al.  The time-course of prediction in incremental sentence processing: Evidence from anticipatory eye-movements , 2003 .