论文信息 - Multimodal Communication in Computer-Mediated Map Task Scenarios

Multimodal Communication in Computer-Mediated Map Task Scenarios

Multimodal Communication in Computer-Mediated Map Task Scenarios Max M. Louwerse (mlouwerse@ memphis.edu) a Patrick Jeuniaux (pjeuniau@memphis.edu) a Mohammed E. Hoque (mhoque@memphis.edu) c Jie Wu (wjie@mail.psyc.memphis.edu) b Gwyneth Lewis (glewis@mail.psyc.memphis.edu) a Department of Psychology / Institute for Intelligent Systems a Department of Computer Science / Institute for Intelligent Systems b Department of Electrical and Computer Engineering / Institute for Intelligent Systems c Memphis, TN 38152 USA Abstract Multimodal communication involves the co-occurrence of different communicative channels, including speech, eye gaze and facial expressions. The questions addressed in this study are how these modalities correlate and how they are aligned to the discourse structure. The study focuses on a map task scenario whereby participants coordinate a route on a map, while their speech, eye gaze, face and torso are recorded. Results show that eye gaze, facial expression and pauses correlate at certain points in the discourse and that these points can be identified by the speaker’s intentions behind the dialog moves. This study thereby sheds light on multimodal communication in humans and gives guidelines for implementation in animated conversational agents. Introduction Most communicative processes require multiple channels, both linguistic and paralinguistic (Clark, 1996). For instance, we talk on the phone while gesturing, we seek eye contact when we want to speak, we maintain eye contact to ensure that the dialog participant comprehends us, and we express our emotional and cognitive states through facial expressions. These different communicative channels play an important role in the interpretation of an utterance by the dialog partner. For instance, the interpretation of “Are you hungry?” depends on the context (e.g. just before going to a restaurant, during dinner), depends on eye gaze (staring somebody in the eyes or looking away), depends on prosody (e.g. stress on ‘you’ or ‘hungry’), facial expressions (e.g. surprised look, disgusted look) and gestures (e.g. rubbing stomach, pointing at a restaurant). While multimodal communication is easy to comprehend for dialog participants, it is hard to monitor and analyze for researchers. Despite the fact that we know linguistic modalities (e.g. dialog move, intonation, pause) and paralinguistic modalities (e.g. facial expressions, eye gaze, gestures) co- occur in communication, the exact nature of their interaction remains unclear (Louwerse, Bard, Steedman, Graesser & Hu, 2004). There are two primary reasons why an insight in the interaction of modalities in the communicative process is beneficial. First, from a psychological point of view it helps us understand how communicative processes take shape in the minds of dialog participants. Under what psychological conditions are different channels aligned? Does a channel add information to the communicative process or does it merely co-occur with other channels? Research in psychology has shed light on the interaction of modalities, for instance comparing eye gaze (Argyle & Cook, 1976; Doherty-Sneddon, et al. 1997), gestures (Goldin-Meadow, 2003; Louwerse & Bangerter, 2005; McNeill, 1992) and facial expressions (Ekman, 1979) but many questions regarding multiple – i.e., more than pairs of – channels and their alignment remain unanswered. Second, insight in multimodal communication is beneficial from a computational point of view, for instance in the development of animated conversational agents (Louwerse, Graesser, Lu, & Mitchell, 2004). The naturalness of the human-computer interaction can be maximized by the use of animated conversational agents, because of the availability of both linguistic (semantics, syntax) and paralinguistic (pragmatic, sociological) features. These animated agents have anthropomorphic, automated, talking heads with facial features and gestures that are coordinated with text-to-speech-engines (Cassell & Thorisson, 1999; Massaro & Cohen, 1994; Picard, 1997). Examples of these agents are Baldi (Massaro & Cohen, 1994), COSMO (Lester, Stone & Stelling, 1999), STEVE (Rickel & Johnson, 1999), Herman the Bug (Lester, Stone, Stelling, 1999) and AutoTutor (Graesser, Person, et al., 2001). Though the naturalness of these agents is progressively changing, there is room for improvement. Current agents for instance incessantly stare at the dialog partner, use limited facial features rather randomly, or produce bursts of unpaused speech. Both psycholinguistics and computational linguistics would thus benefit from answers to questions regarding the interaction of multimodal channels. A specific and related question concerns the mapping of these channels onto the discourse structure. Research has shown that the structure of the dialog can often predict these modalities. For instance, Taylor, King, Isard, & Wright (1998) and Hastie-Wright, Poesio, and Isard (2002) have

Gwenyth A. Lewis | M. Louwerse | Ehsan Hoque | Patrick Jeuniaux | Jie Wu