User-centered modeling and evaluation of multimodal interfaces

Historically, the development of computer interfaces has been a technology-driven phenomenon. However, new multimodal interfaces are composed of recognition-based technologies that must interpret human speech, gesture, gaze, movement patterns, and other complex natural behaviors, which involve highly automatized skills that are not under full conscious control. As a result, it now is widely acknowledged that multimodal interface design requires modeling of the modality-centered behavior and integration patterns upon which multimodal systems aim to build. This paper summarizes research on the cognitive science foundations of multimodal interaction, and on the essential role that user-centered modeling has played in prototyping, guiding, and evaluating the design of next-generation multimodal interfaces. In particular, it discusses the properties of different modalities and the information content they carry, the unique features of multimodal language and its processability, as well as when users are likely to interact multimodally and how their multimodal input is integrated and synchronized. It also reviews research on typical performance and linguistic efficiencies associated with multimodal interaction, and on the user-centered reasons why multimodal interaction minimizes errors and expedites error handling. In addition, this paper describes the important role that selective methodologies and evaluation metrics have played in shaping next-generation multimodal systems, and it concludes by highlighting future directions for designing a new class of adaptive multimodal-multisensor interfaces.

[1]  M. Just,et al.  Eye fixations and cognitive processes , 1976, Cognitive Psychology.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[4]  Sharon L. Oviatt,et al.  Integration themes in multimodal human-computer interaction , 1994, ICSLP.

[5]  Zenzi M. Griffin,et al.  PSYCHOLOGICAL SCIENCE Research Article WHAT THE EYES SAY ABOUT SPEAKING , 2022 .

[6]  A. Kendon Gesticulation and Speech: Two Aspects of the Process of Utterance , 1981 .

[7]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[8]  William S. Condon,et al.  An Analysis of Behavioral Organization , 2013 .

[9]  Arne Jönsson,et al.  Wizard of Oz studies -- why and how , 1993, Knowl. Based Syst..

[10]  Sharon L. Oviatt,et al.  Multimodal integration patterns in children , 2002, INTERSPEECH.

[11]  Sharon L. Oviatt,et al.  Referential features and linguistic indirection in multimodal language , 1998, ICSLP.

[12]  John Makhoul,et al.  DESIGNING CONVERSATIONAL INTERFACES WITH MULTIMODAL INTERACTION , 1998 .

[13]  Ali Adjoudani,et al.  Audio-visual speech recognition compared across two architectures , 1995, EUROSPEECH.

[14]  Adam Cheyer,et al.  MVIEWS: multimodal tools for the video analyst , 1998, IUI '98.

[15]  Thomas G. Holzman Computer-human interface solutions for emergency medical care , 1999, INTR.

[16]  Shumin Zhai,et al.  Manual and gaze input cascaded (MAGIC) pointing , 1999, CHI '99.

[17]  Philip R. Cohen,et al.  Synergistic use of direct manipulation and natural language , 1989, CHI '89.

[18]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[19]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[20]  Dominic W. Massaro,et al.  SPEECH RECOGNITION AND SENSORY INTEGRATION , 1998 .

[21]  Sharon L. Oviatt,et al.  Modeling multimodal integration patterns and performance in seniors: toward adaptive processing of individual differences , 2003, ICMI '03.

[22]  Sharon L. Oviatt,et al.  Mutual disambiguation of recognition errors in a multimodel architecture , 1999, CHI '99.

[23]  Sharath Pankanti,et al.  Biometrics: The Future of Identification - Guest Editors' Introduction , 2000, Computer.

[24]  P R Cohen,et al.  The role of voice input for human-machine communication. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Lynne E. Bernstein,et al.  Modeling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition , 1998, Speech Commun..

[26]  Sharon L. Oviatt,et al.  Perceptual user interfaces: multimodal interfaces that process what comes naturally , 2000, CACM.

[27]  A. Macleod,et al.  Quantifying the contribution of vision to speech perception in noise. , 1987, British journal of audiology.

[28]  Kristinn R. Thórisson,et al.  Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures , 1991, AAAI Workshop on Intelligent Multimedia Interfaces.

[29]  Matthew Turk,et al.  Perceptual user interfaces , 2000 .

[30]  Sharon L. Oviatt,et al.  Toward interface design for human language technology: Modality and structure as determinants of linguistic complexity , 1994, Speech Communication.

[31]  Sharon L. Oviatt,et al.  Toward a theory of organized multimodal integration patterns during human-computer interaction , 2003, ICMI '03.

[32]  Sharon L. Oviatt,et al.  Designing the User Interface for Multimodal Speech and Pen-Based Gesture Applications: State-of-the-Art Systems and Future Research Directions , 2000, Hum. Comput. Interact..

[33]  Li Deng,et al.  Mipad: a next generation PDA prototype , 2000, INTERSPEECH.

[34]  Vladimir Pavlovic,et al.  Visual Interpretation of Hand Gestures for Human-Computer Interaction: A Review , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[35]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[36]  Catherine Pelachaud,et al.  Audio-visual and multimodal speech-based systems , 2000 .

[37]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[38]  Sharath Pankanti,et al.  Guest Editors' Introduction: Biometrics-The Future of Identification , 2000 .

[39]  C D Wickens,et al.  Compatibility and Resource Competition between Modalities of Input, Central Processing, and Output , 1983, Human factors.

[40]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[41]  Bernhard Fröba,et al.  Statistical Sensor Calibration for Fusion of Different Classifiers in a Biometric Person Recognition Framework , 2000, Multiple Classifier Systems.

[42]  S Oviatt,et al.  Linguistic Adaptations During Spoken and Multimodal Error Resolution , 1998, Language and speech.

[43]  D. McNeill Hand and Mind: What Gestures Reveal about Thought , 1992 .

[44]  Philip R. Cohen,et al.  Discourse structure and performance efficiency in interactive and non-interactive spoken modalities☆ , 1991 .

[45]  Benoît Maison,et al.  On the use of visual information for improving audio-based speaker recognition , 1999, AVSP.

[46]  Peter L. Silsbee,et al.  Audiovisual Sensory Integration Using Hidden Markov Models , 1996 .

[47]  Christian Benoît,et al.  Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP , 1998, Speech Commun..

[48]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[49]  Philip R. Cohen,et al.  The Efficiency of Multimodal Interaction for a Map-based Task , 1999, ANLP.

[50]  Michael S. Lew Next-Generation Web Searches for Visual Content , 2000, Computer.

[51]  Alexander G. Hauptmann,et al.  Speech and gestures for graphic image manipulation , 1989, CHI '89.

[52]  Philip R. Cohen,et al.  MULTIMODAL INTERFACES THAT PROCESS WHAT COMES NATURALLY , 2000 .

[53]  P. Kricos Differences in Visual Intelligibility Across Talkers , 1996 .

[54]  Christian Abry,et al.  How can coarticulation models account for speech sensitivity to audio-visual desynchronization? , 1996 .

[55]  Paul P. Maglio,et al.  SUITOR: an attentive information system , 2000, IUI '00.

[56]  Arun Ross,et al.  Information fusion in biometrics , 2003, Pattern Recognit. Lett..

[57]  Bernhard Suhm,et al.  Multimodal interactive error recovery for non-conversational speech user interfaces , 1999 .

[58]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[59]  Satoshi Nakamura,et al.  Stream weight optimization of speech and lip image sequence for audio-visual speech recognition , 2000, INTERSPEECH.

[60]  Sharon L. Oviatt,et al.  Ten myths of multimodal interaction , 1999, Commun. ACM.

[61]  Matthew Turk,et al.  Perceptual user interfaces (introduction) , 2000, CACM.

[62]  Mary P. Harper,et al.  Gestural spatialization in natural discourse segmentation , 2002, INTERSPEECH.

[63]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[64]  Antonella De Angeli,et al.  Integration and synchronization of input modes during multimodal human-computer interaction , 1997, CHI.

[65]  A. Kendon Some functions of gaze-direction in social interaction. , 1967, Acta psychologica.

[66]  Francis K. H. Quek,et al.  Gestural trajectory symmetries and discourse segmentation , 2002, INTERSPEECH.

[67]  Alex Waibel,et al.  Multimodal interfaces for multimedia information agents , 1997 .

[68]  Sharon Oviatt,et al.  Multimodal interactive maps: designing for human performance , 1997 .

[69]  Oliviero Stock Language-Based Interfaces and Their Application for Cultural Tourism , 2001, AI Mag..

[70]  Sharon L. Oviatt,et al.  Breaking the Robustness Barrier: Recent Progress on the Design of Robust Multimodal Systems , 2002, Adv. Comput..

[71]  Sharon L. Oviatt,et al.  A rapid semi-automatic simulation technique for investigating interactive speech and handwriting , 1992, ICSLP.

[72]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[73]  Y. Tohkura,et al.  McGurk effect in non-English listeners: few visual effects for Japanese subjects hearing Japanese syllables of high auditory intelligibility. , 1991, The Journal of the Acoustical Society of America.

[74]  Benoît Maison,et al.  Perceptual interfaces for information interaction: joint processing of audio and visual information for human-computer interaction , 2000, INTERSPEECH.

[75]  Angela Fuster-Duran Perception of Conflicting Audio-Visual Speech: an Examination across Spanish and German , 1996 .

[76]  K. Rayner Eye movements in reading and information processing: 20 years of research. , 1998, Psychological bulletin.

[77]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[78]  Peter L. Silsbee,et al.  Robust audiovisual integration using semicontinuous hidden Markov models , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[79]  A BoltRichard,et al.  Put-that-there , 1980 .

[80]  Arun Ross,et al.  Learning user-specific parameters in a multibiometric system , 2002, Proceedings. International Conference on Image Processing.

[81]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .