Multimodal Communication in Computer-Mediated Map Task Scenarios

Multimodal Communication in Computer-Mediated Map Task Scenarios Max M. Louwerse (mlouwerse@ memphis.edu) a Patrick Jeuniaux (pjeuniau@memphis.edu) a Mohammed E. Hoque (mhoque@memphis.edu) c Jie Wu (wjie@mail.psyc.memphis.edu) b Gwyneth Lewis (glewis@mail.psyc.memphis.edu) a Department of Psychology / Institute for Intelligent Systems a Department of Computer Science / Institute for Intelligent Systems b Department of Electrical and Computer Engineering / Institute for Intelligent Systems c Memphis, TN 38152 USA Abstract Multimodal communication involves the co-occurrence of different communicative channels, including speech, eye gaze and facial expressions. The questions addressed in this study are how these modalities correlate and how they are aligned to the discourse structure. The study focuses on a map task scenario whereby participants coordinate a route on a map, while their speech, eye gaze, face and torso are recorded. Results show that eye gaze, facial expression and pauses correlate at certain points in the discourse and that these points can be identified by the speaker’s intentions behind the dialog moves. This study thereby sheds light on multimodal communication in humans and gives guidelines for implementation in animated conversational agents. Introduction Most communicative processes require multiple channels, both linguistic and paralinguistic (Clark, 1996). For instance, we talk on the phone while gesturing, we seek eye contact when we want to speak, we maintain eye contact to ensure that the dialog participant comprehends us, and we express our emotional and cognitive states through facial expressions. These different communicative channels play an important role in the interpretation of an utterance by the dialog partner. For instance, the interpretation of “Are you hungry?” depends on the context (e.g. just before going to a restaurant, during dinner), depends on eye gaze (staring somebody in the eyes or looking away), depends on prosody (e.g. stress on ‘you’ or ‘hungry’), facial expressions (e.g. surprised look, disgusted look) and gestures (e.g. rubbing stomach, pointing at a restaurant). While multimodal communication is easy to comprehend for dialog participants, it is hard to monitor and analyze for researchers. Despite the fact that we know linguistic modalities (e.g. dialog move, intonation, pause) and paralinguistic modalities (e.g. facial expressions, eye gaze, gestures) co- occur in communication, the exact nature of their interaction remains unclear (Louwerse, Bard, Steedman, Graesser & Hu, 2004). There are two primary reasons why an insight in the interaction of modalities in the communicative process is beneficial. First, from a psychological point of view it helps us understand how communicative processes take shape in the minds of dialog participants. Under what psychological conditions are different channels aligned? Does a channel add information to the communicative process or does it merely co-occur with other channels? Research in psychology has shed light on the interaction of modalities, for instance comparing eye gaze (Argyle & Cook, 1976; Doherty-Sneddon, et al. 1997), gestures (Goldin-Meadow, 2003; Louwerse & Bangerter, 2005; McNeill, 1992) and facial expressions (Ekman, 1979) but many questions regarding multiple – i.e., more than pairs of – channels and their alignment remain unanswered. Second, insight in multimodal communication is beneficial from a computational point of view, for instance in the development of animated conversational agents (Louwerse, Graesser, Lu, & Mitchell, 2004). The naturalness of the human-computer interaction can be maximized by the use of animated conversational agents, because of the availability of both linguistic (semantics, syntax) and paralinguistic (pragmatic, sociological) features. These animated agents have anthropomorphic, automated, talking heads with facial features and gestures that are coordinated with text-to-speech-engines (Cassell & Thorisson, 1999; Massaro & Cohen, 1994; Picard, 1997). Examples of these agents are Baldi (Massaro & Cohen, 1994), COSMO (Lester, Stone & Stelling, 1999), STEVE (Rickel & Johnson, 1999), Herman the Bug (Lester, Stone, Stelling, 1999) and AutoTutor (Graesser, Person, et al., 2001). Though the naturalness of these agents is progressively changing, there is room for improvement. Current agents for instance incessantly stare at the dialog partner, use limited facial features rather randomly, or produce bursts of unpaused speech. Both psycholinguistics and computational linguistics would thus benefit from answers to questions regarding the interaction of multimodal channels. A specific and related question concerns the mapping of these channels onto the discourse structure. Research has shown that the structure of the dialog can often predict these modalities. For instance, Taylor, King, Isard, & Wright (1998) and Hastie-Wright, Poesio, and Isard (2002) have

[1]  M. Cranach,et al.  Human Ethology: Claims and Limits of a New Discipline. , 1982 .

[2]  Anne H. Anderson,et al.  The Hcrc Map Task Corpus , 1991 .

[3]  M. Argyle,et al.  Gaze and Mutual Gaze , 1994, British Journal of Psychiatry.

[4]  M. Studdert-Kennedy Hand and Mind: What Gestures Reveal About Thought. , 1994 .

[5]  D W Massaro,et al.  Visual, orthographic, phonological, and lexical influences in reading. , 1994, Journal of experimental psychology. Human perception and performance.

[6]  Gwyneth Doherty-Sneddon,et al.  The Reliability of a Dialogue Structure Coding Scheme , 1997, CL.

[7]  Gwyneth Doherty-Sneddon,et al.  Face-to-face and video mediated communication: a comparison of dialogue structure and task performance , 1997 .

[8]  P Taylor,et al.  Intonation and dialogue context as constraints for speech recognition , 1998 .

[9]  W. Lewis Johnson,et al.  Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control , 1999, Appl. Artif. Intell..

[10]  H. Branigan,et al.  Non-linguistic influences on rates of disfluency in spontaneous speech , 1999 .

[11]  Kristinn R. Thórisson,et al.  The Power of a Nod and a Glance: Envelope Vs. Emotional Feedback in Animated Conversational Agents , 1999, Appl. Artif. Intell..

[12]  Rosalind W. Picard,et al.  An affective model of interplay between emotions and learning: reengineering educational pedagogy-building a learning companion , 2001, Proceedings IEEE International Conference on Advanced Learning Technologies.

[13]  E. Vesterinen,et al.  Affective Computing , 2009, Encyclopedia of Biometrics.

[14]  Arthur C. Graesser,et al.  Teaching Tactics and Dialog in AutoTutor , 2001 .

[15]  Helen F. Hastie,et al.  Automatically predicting dialogue structure using prosodic features , 2002, Speech Commun..

[16]  Ka Cormier Gesture: The Living Medium , 2002 .

[17]  Paul Boersma,et al.  Praat, a system for doing phonetics by computer , 2002 .

[18]  Maria L. Flecha-Garcia Eyebrow raising and communication in Map Task dialogues , 2003 .

[19]  S. Goldin-Meadow,et al.  Hearing Gesture: How Our Hands Help Us Think , 2003 .

[20]  Heather H. Mitchell,et al.  Toward a Taxonomy of a Set of Discourse Markers in Dialog: A Theoretical and Computational Linguistic Account , 2003 .

[21]  Susan Goldin Hearing gesture : how our hands help us think , 2003 .

[22]  James C. Lester,et al.  Lifelike Pedagogical Agents for Mixed-initiative Problem Solving in Constructivist Learning Environments , 2004, User Modeling and User-Adapted Interaction.

[23]  Scotty D. Craig,et al.  Affect and learning: An exploratory look into the role of affect in learning with AutoTutor , 2004 .

[24]  Max M. Louwerse,et al.  Focusing Attention with Deictic Gestures and Linguistic Expressions , 2005 .

[25]  Heather H. Mitchell,et al.  Social Cues in Animated Conversational Agents , 2005 .

[26]  P. Boersma Praat : doing phonetics by computer (version 4.4.24) , 2006 .

[27]  Max M. Louwerse,et al.  Dialog Act Classification Using N-Gram Algorithms , 2006, FLAIRS.