Incremental Modeling of Language Understanding Using Speech Act Frames

Incremental Modeling of Language Understanding Using Speech Act Frames Wende Frost (wende.frost@asu.edu) Arizona State University School of Computing and Informatics, 699 S. Mill Avenue Tempe, AZ 85281 USA Magdalena Bugajska (magda.bugajska@nrl.navy.mil) J. Gregory Trafton (greg.trafton@nrl.navy.mil) Navy Center for Applied Research in Artificial Intelligence Naval Research Laboratory, 4555 Overlook Avenue SW Washington, DC 20375 USA Abstract Using the cognitive architecture ACT-R/E, we designed a framework for implementing cognitively plausible spoken language understanding on an embodied agent using incremental frame representations for multiple levels of linguistic knowledge. Emphasis is placed on semantics, pragmatics, and speaker intent. Keywords: cognitive pragmatics; semantics modeling; language processing; Introduction One of the greatest challenges in building and interacting with embodied agents is integrating cognitive plausibility without sacrificing usability. This challenge is very evident when looking at natural language understanding in spoken language environments. In real-life scenarios, agents need to respond to commands, gather information, and answer queries quickly even when faced with unexpected input from human users. Unexpected input can occur at many different linguistic levels, whether due to the agent’s speech recognition software failing to recognize a word, the need to process and understand an irregular syntactic utterance, or by verbal interruptions in the middle of a task. While failure to promptly cope with all kinds of unexpected input leads to an agent being less useful in the field, it also exposes a lack of cognitive plausibility in the framework of the model. Humans do not reach long-lasting impasses when faced with any of these relatively simple situations (Gibson, 1991). In addition to performing such tasks, the agent should also be able to hear, process, and remember utterances the speaker directed at other agents in the environment without mistaking them for commands or queries the speaker expected the agent to achieve. A cognitively plausible agent would then be able to use these utterances directed toward others, especially relatively recent ones, and incorporate them with world knowledge for use in future goals. We have implemented a framework within the ACT-R/E 6 cognitive architecture (Anderson et al., 2004; Anderson & Lebiere, 1998) which aims to fulfill these requirements at both functional and plausible levels. The framework’s focus is to obtain a correct interpretation of the speaker’s intentions (i.e. what the speaker wants) based upon the current state of the world and what existing past world knowledge a model has in memory, rather than a syntactically exact parse of the utterance (i.e. what the speaker said) divorced from an outside environment. Processing in our framework is done at all levels for each word as it comes in. It retrieves, creates, and edits frames of knowledge from the phonological to the pragmatic at each step, enabling an agent to have a constantly developing picture of the utterance. Our framework differs from other natural language work in the ACT-R family of architectures by focusing on processing spoken natural language in real-time and on emphasizing pragmatic and semantic roles in the utterance. It was also created to be easily expandable in other useful embodied directions, including processing gestural information as part of an utterance structure. In keeping with the fundamental notion that language processing is another aspect of human cognition subject to the same representations and processes as other cognitive activities, our framework does not implement a dedicated “language module.” (Croft, 2004) Instead, language processing is done across existing modules. This non-dedicated module approach differentiates our framework from much other ACT-R work on language, such as Ball (2007) and Emond (2006). As with other work in ACT-R that does not have a dedicated language module (Lewis & Vasishth, 2005; Lewis, Vasishth, & Van Dyke, 2006), we have included an additional buffer to store language information. Unlike this work, however, we do not have any parallel lexical access mechanisms. Our divergence from a modular approach is also similar to the NL-SOAR language comprehension theory implemented in the SOAR cognitive architecture, but NL-SOAR focuses on explaining a large number of sentence-level syntactic phenomena (Lewis, 1993), whereas we place more emphasis on semantics and pragmatics. To achieve a cognitively plausible framework to model natural language understanding, we used the ACT-R/E cognitive architecture with the default ACT-R parameters set. ACT-R 6 is a production system architecture composed of two kinds of knowledge: declarative and procedural. Declarative knowledge, also known as factual knowledge, is stored in long term declarative memory as “chunks.” These chunks, as well as chunks based upon perceptual

[1]  J. Sadock Speech acts , 2007 .

[2]  P. Hagoort The fractionation of spoken language understanding by measuring electrical and magnetic brain signals , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[3]  B. Fraser An approach to discourse markers , 1990 .

[4]  John R Anderson,et al.  An integrated theory of the mind. , 2004, Psychological review.

[5]  Richard L. Lewis,et al.  An Activation-Based Model of Sentence Processing as Skilled Memory Retrieval , 2005, Cogn. Sci..

[6]  Csr Young,et al.  How to Do Things With Words , 2009 .

[7]  Richard L. Lewis,et al.  Computational principles of working memory in sentence comprehension , 2006, Trends in Cognitive Sciences.

[8]  Satoshi Sekine,et al.  Using Phrasal Patterns to Identify Discourse Relations , 2006, HLT-NAACL.

[9]  Pat Langley,et al.  A Unified Cognitive Architecture for Physical Agents , 2006, AAAI.

[10]  C. Lebiere,et al.  The Atomic Components of Thought , 1998 .

[11]  James R. Lewis,et al.  Effect of Error Correction Strategy on Speech Dictation Throughput , 1999 .

[12]  Allen Newell,et al.  The psychology of human-computer interaction , 1983 .

[13]  Gerry T. M. Altmann,et al.  Thematic role assignment in context , 1999 .

[14]  Ani Nenkova,et al.  Easily Identifiable Discourse Relations , 2008, COLING.

[15]  James R. Williams,et al.  Guidelines for the Use of Multimedia in Instruction , 1998 .

[16]  Jerry T. Ball,et al.  Construction Driven Language Processing , 2005 .

[17]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[18]  A. Goldberg Constructions: A Construction Grammar Approach to Argument Structure , 1995 .

[19]  Edward Gibson,et al.  A computational theory of human linguistic processing: memory limitations and processing breakdown , 1991 .