A Connectionist Model of the Coordinated Interplay of Scene, Utterance, and World Knowledge Marshall R. Mayberry, III (martym@coli.uni-sb.de) Matthew W. Crocker (crocker@coli.uni-sb.de) Pia Knoeferle (knoeferle@coli.uni-sb.de) Department of Computational Linguistics, Saarland University, 66041 Saarbr¨ucken, Germany on-line situated utterance comprehension. First, on-line com- prehension occurs incrementally and is closely time-locked with attention to the scene (Tanenhaus, Spivey-Knowlton, Eberhard, & Sedivy, 1995). Second, attention to objects in a scene before they are mentioned in an utterance shows that anticipation plays a vital role in comprehension (Alt- mann & Kamide, 1999). Third, all available information sources–linguistic and world knowledge, as well as scene information–are rapidly and seamlessly integrated during on- line comprehension (Knoeferle, Crocker, Scheepers, & Pick- ering, 2005; Kamide, Scheepers, & Altmann, 2003; Sedivy, Tanenhaus, Chambers, & Carlson, 1999; Tanenhaus et al., 1995). Fourth, sentence comprehension is highly adaptive to the dynamic availability of information from these multiple sources. Fifth, these sources of information are coordinated: the interaction between language and visual scene process- ing is a two-way street. Comprehension of the unfolding ut- terance both rapidly guides attention to objects in the scene and, in turn, the attended region of the scene tightly con- strains and influences comprehension, a process Knoeferle and Crocker (in press) dub the coordinated interplay account (CIA). Furthermore, a full account of this interaction must address the issue of what happens when information sources conflict: which sources take precedence and why? Recent re- search on the interaction between world knowledge and infor- mation from a visual scene indicate that immediate depicted events are preferred over knowledge about stereotypical re- lationships when these conflict. Knoeferle and Crocker sug- gest that such a preference may have its basis in the role the immediate visual environment plays in child-directed speech during language acquisition (e.g., Snow, 1977). These characteristics of situated utterance comprehension pose an interesting challenge for modellers. The success- ful model should operate incrementally, anticipate upcoming referents, rapidly and seamlessly integrate information from multiple sources, adapt to available information, exhibit the observed attentional shift during utterance comprehension, and demonstrate the observed preference for the depicted information over world knowledge when these information sources conflict. Two recently proposed models feature several of these characteristics. The Fuse model by Roy and Mukherjee (2005) uses an attentional mechanism to constrain the num- ber of referents to improve speech recognition. The system does predict different ways a person might describe objects in a scene and biases how the words are recognized. The scene employed contains only objects, and is always assumed Abstract The interaction of utterance comprehension and information from a visual scene is characterized by the closely time-locked coordination of incremental comprehension and attention in the scene. Comprehension is also anticipatory, as revealed by attention to objects in a scene before they are mentioned. The interaction is further marked by the rapid and seamless inte- gration of, and adaptation to, diverse information sources in both the utterance and visual scene. These sources can inter- act dynamically, both complementarily and, at times, conflict- ingly. A recurrent sigma-pi neural network is presented that implements an attentional mechanism to model these behav- iors, directly instantiating the coordinated interplay account that suggests the utterance guides attention in the scene, which in turn rapidly provides information that influences compre- hension. A key aspect of the account is that the immediacy of depicted events in the scene takes precedence over stereotyp- ical knowledge when these two information sources conflict. Crucially, the model captures this behavior without being ex- plicitly trained to resolve the conflict, even when the relative frequency of the information sources differs greatly. Keywords: Connectionist modelling; situated utterance com- prehension; language-scene interaction; attention Introduction All human communication occurs in context. Indeed, even the so-called isolated phrase, coveted by linguists for its self- contained syntactic and semantic properties, is understood only within the context of human experience. In this way, the study of how language relates to its context provides insight into the very nature of language itself: how it means anything at all. Understanding the interaction of language and context, such as a visual environment, serves to identify and delineate the cognitive mechanisms involved in language comprehen- sion, and how resources such as linguistic and world knowl- edge, as well as information from the visual context, are uti- lized. This challenge is especially daunting because language is inherently dynamic, and the utilization of these various in- formation sources must be coordinated in real time. Fortunately, a growing body of psycholinguistic research in the visual worlds experimental paradigm, wherein sub- jects’ eye movements over a visual scene are monitored as they listen to an utterance, has begun to yield tangible data on the nature of the on-line interaction of utterance comprehen- sion and context. Typically, that context is a visual scene that can establish referents and relations, together with the partic- ipants’ own linguistic and world knowledge. The analysis of eye movements in a scene during utterance comprehension under the controlled manipulation of a variety of informa- tion sources has revealed five fundamental characteristics of
[1]
Matthew W. Crocker,et al.
The Coordinated Interplay of Scene, Utterance, and World Knowledge: Evidence From Eye Tracking
,
2006,
Cogn. Sci..
[2]
Deb Roy,et al.
Towards situated speech understanding: visual context priming of language models
,
2005,
Comput. Speech Lang..
[3]
C. A. Ferguson,et al.
Talking to Children: Language Input and Acquisition
,
1979
.
[4]
Matthew W. Crocker,et al.
The influence of the immediate visual context on incremental thematic role-assignment: evidence from eye-movements in depicted events
,
2005,
Cognition.
[5]
Roger K. Moore.
Computer Speech and Language
,
1986
.
[6]
Julie C. Sedivy,et al.
Achieving incremental semantic interpretation through contextual representation
,
1999,
Cognition.
[7]
G. Altmann,et al.
Incremental interpretation at verbs: restricting the domain of subsequent reference
,
1999,
Cognition.
[8]
Geoffrey E. Hinton,et al.
Learning internal representations by error propagation
,
1986
.
[9]
Julie C. Sedivy,et al.
Subject Terms: Linguistics Language Eyes & eyesight Cognition & reasoning
,
1995
.
[10]
Jeffrey L. Elman,et al.
Finding Structure in Time
,
1990,
Cogn. Sci..
[11]
Antony Browne,et al.
Neural Network Perspectives on Cognition and Adaptive Robotics
,
1997
.
[12]
Christoph Scheepers,et al.
Integration of Syntactic and Semantic Information in Predictive Processing: Cross-Linguistic Evidence from German and English
,
2003,
Journal of psycholinguistic research.
[13]
Risto Miikkulainen,et al.
Natural Language Processing with Subsymbolic Neural Networks
,
2019,
Neural Network Perspectives on Cognition and Adaptive Robotics.
[14]
M. Hartley,et al.
Attention as sigma-pi controlled ACh-based feedback
,
2005,
Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..
[15]
Marshall R. Mayberry,et al.
A Connectionist Model of Sentence Comprehension in Visual Worlds
,
2005
.