A framework for evaluating multimodal integration by humans and a role for embodied conversational agents

One of the implicit assumptions of multi-modal interfaces is that human-computer interaction is significantly facilitated by providing multiple input and output modalities. Surprisingly, however, there is very little theoretical and empirical research testing this assumption in terms of the presentation of multimodal displays to the user. The goal of this paper is provide both a theoretical and empirical framework for addressing this important issue. Two contrasting models of human information processing are formulated and contrasted in experimental tests. According to integration models, multiple sensory influences are continuously combined during categorization, leading to perceptual experience and action. The Fuzzy Logical Model of Perception (FLMP) assumes that processing occurs in three successive but overlapping stages: evaluation, integration, and decision (Massaro, 1998). According to nonintegration models, any perceptual experience and action results from only a single sensory influence. These models are tested in expanded factorial designs in which two input modalities are varied independently of one another in a factorial design and each modality is also presented alone. Results from a variety of experiments on speech, emotion, and gesture support the predictions of the FLMP. Baldi, an embodied conversational agent, is described and implications for applications of multimodal interfaces are discussed.

[1]  M. Mesulam,et al.  From sensation to cognition. , 1998, Brain : a journal of neurology.

[2]  Dominic W. Massaro,et al.  Training a talking head , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[3]  Sharon L. Oviatt,et al.  Toward a theory of organized multimodal integration patterns during human-computer interaction , 2003, ICMI '03.

[4]  J. Townsend,et al.  Computational, Geometric, and Process Perspectives on Facial Cognition : Contexts and Challenges , 2005 .

[5]  Michael D. Harrison,et al.  Formal methods in human-computer interaction , 1990 .

[6]  Jonas Beskow,et al.  FROM THEORY TO PRACTICE: REWARDS AND CHALLENGES , 1999 .

[7]  Dominic W. Massaro,et al.  The processing of information from multiple sources in simultaneous interpreting , 2000 .

[8]  H. Pashler The Psychology of Attention , 1997 .

[9]  Dominic W. Massaro,et al.  A computer-animated tutor for spoken and written language learning , 2003, ICMI '03.

[10]  Sven Wachsmuth,et al.  Evaluating integrated speech- and image understanding , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[11]  Dominic W. Massaro,et al.  SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[12]  Dominic W. Massaro,et al.  Animated speech: research progress and applications , 2001, AVSP.

[13]  Shimei Pan,et al.  Context-based multimodal input understanding in conversational systems , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[14]  Dominic W. Massaro,et al.  Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllables , 1993, Speech Commun..

[15]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[16]  D. Massaro Multimodal Speech Perception: A Paradigm for Speech Science , 2002 .

[17]  Slim Ouni,et al.  Internationalization of a Talking Head , 2003 .

[18]  Béatrice de Gelder,et al.  Perceiving emotions by ear and by eye , 2003, INTERSPEECH.

[19]  Jonas Beskow,et al.  Recent Developments In Facial Animation: An Inside View , 1998, AVSP.

[20]  James L. McClelland,et al.  The Morton-Massaro law of information integration: implications for models of perception. , 2001, Psychological review.

[21]  Sandra L. Calvert,et al.  Brief Report: Vocabulary Acquisition for Children with Autism: Teacher or Computer Instruction , 2000, Journal of autism and developmental disorders.

[22]  Philip R. Cohen,et al.  A map-based system using speech and 3D gestures for pervasive computing , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[23]  D. Massaro,et al.  Models of integration given multiple sources of information. , 1990, Psychological review.

[24]  L A Thompson,et al.  Children's integration of speech and pointing gestures in comprehension. , 1994, Journal of experimental child psychology.

[25]  G. Plant Perceiving Talking Faces: From Speech Perception to a Behavioral Principle , 1999 .

[26]  D. Massaro,et al.  Development and Evaluation of a Computer-Animated Tutor for Vocabulary and Language Learning in Children with Autism , 2003, Journal of autism and developmental disorders.

[27]  D. Massaro,et al.  Fuzzy logical model of bimodal emotion perception: Comment on “The perception of emotions by ear and by eye” by de Gelder and Vroomen , 2000 .

[28]  D. Massaro Testing between the TRACE model and the fuzzy logical model of speech perception , 1989, Cognitive Psychology.

[29]  D W Massaro,et al.  Children's perception of visual and auditory speech. , 1984, Child development.

[30]  D W Massaro,et al.  Speech perception in perceivers with hearing loss: synergy of multiple modalities. , 1999, Journal of speech, language, and hearing research : JSLHR.

[31]  Lotfi A. Zadeh,et al.  Fuzzy Sets , 1996, Inf. Control..

[32]  Dominic W. Massaro,et al.  Read my tongue movements: bimodal learning to perceive and produce non-native speech /r/ and /l/ , 2003, INTERSPEECH.

[33]  B. Stein,et al.  The Merging of the Senses , 1993 .

[34]  D. Massaro Ambiguity in perception and experimentation. , 1988, Journal of experimental psychology. General.

[35]  D. McNeill So you think gestures are nonverbal , 1985 .

[36]  N. P. Erber,et al.  Auditory, visual, and auditory-visual recognition of consonants by children with normal and impaired hearing. , 1972, Journal of speech and hearing research.

[37]  E. Horvitz,et al.  ON in Computing and P rinciples to Applications , 2003 .

[38]  HorvitzEric,et al.  Models of attention in computing and communication , 2003 .