A Study in Efficiency and Modality Usage in Multimodal Form Filling Systems

The usage patterns of speech and visual input modes are investigated as a function of relative input mode efficiency for both desktop and personal digital assistant (PDA) working environments. For this purpose the form-filling part of a multimodal dialogue system is implemented and evaluated; three multimodal modes of interaction are implemented: ldquoClick-to-Talk,rdquo ldquoOpen-Mike,rdquo and ldquoModality-Selection.rdquo ldquoModality-Selectionrdquo implements an adaptive interface where the system selects the most efficient input mode at each turn, effectively alternating between a ldquoClick-to-Talkrdquo and ldquoOpen-Mikerdquo interaction style as proposed in ldquoModality tracking in the multimodal Bell Labs Communicator,rdquo in Proceedings of the Automatic Speech Recognition and Understanding Workshop, by A. Potamianos, , 2003. The multimodal systems are evaluated and compared with the unimodal systems. Objective and subjective measures used include task completion, task duration, turn duration, and overall user satisfaction. Turn duration is broken down into interaction time and inactivity time to better measure the efficiency of each input mode. Duration statistics and empirical probability density functions are computed as a function of interaction context and user. Results show that the multimodal systems outperform the unimodal systems in terms of objective and subjective criteria. Also, users tend to use the most efficient input mode at each turn; however, biases towards the default input modality and a general bias towards the speech modality also exists. Results demonstrate that although users exploit some of the available synergies in multimodal dialogue interaction, further efficiency gains can be achieved by designing adaptive interfaces that fully exploit these synergies.

[1]  Eric Fosler-Lussier,et al.  Information Seeking Spoken Dialogue Systems— Part II: Multimodal Dialogue , 2007, IEEE Transactions on Multimedia.

[2]  Li Deng,et al.  Mipad: a next generation PDA prototype , 2000, INTERSPEECH.

[3]  Marilyn A. Walker,et al.  MATCH: An Architecture for Multimodal Dialogue Systems , 2002, ACL.

[4]  James A. Larson,et al.  Guidelines for multimodal user interface design , 2004, CACM.

[5]  Janienke Sturm,et al.  Effects of prolonged use on the usability of a multimodal form-filling interface , 2004 .

[6]  Sharon L. Oviatt,et al.  The efficiency of multimodal interaction: a case study , 1998, ICSLP.

[7]  Alexandros Potamianos,et al.  BLENDING SPEECH AND VISUAL INPUT IN MULTIMODAL DIALOGUE SYSTEMS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[8]  Emiel Krahmer,et al.  Preferred modalities in dialogue systems , 2000, INTERSPEECH.

[9]  Alexander H. Waibel,et al.  Multimodal interfaces , 1996, Artificial Intelligence Review.

[10]  David S. Ebert,et al.  The integrality of speech in multimodal interfaces , 1998, TCHI.

[11]  Sherif Abdou,et al.  An enhanced BLSTIP dialogue research platform , 2000, INTERSPEECH.

[12]  Marilyn A. Walker,et al.  PARADISE: A Framework for Evaluating Spoken Dialogue Agents , 1997, ACL.

[13]  Nicole Yankelovich,et al.  Conversational speech interfaces , 2002 .

[14]  Helen Mitchard,et al.  Experimental Comparisons of Data Entry by Automated Speech Recognition, Keyboard, and Mouse , 2002, Hum. Factors.

[15]  Niels Ole Bernsen,et al.  Is speech the right thing for your application? , 1998, ICSLP.

[16]  J. Jacko,et al.  The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications , 2002 .

[17]  A. Potamianos,et al.  Modality tracking in the multimodal Bell Labs Communicator , 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721).

[18]  Hong-Kwang Jeff Kuo,et al.  Dialogue management in the Bell Labs communicator system , 2000, INTERSPEECH.

[19]  Kristinn R. Thórisson,et al.  Integrating Simultaneous Input from Speech, Gaze, and Hand Gestures , 1991, AAAI Workshop on Intelligent Multimedia Interfaces.

[20]  Nicole Beringer,et al.  PROMISE - A Procedure for Multimodal Interactive System Evaluation , 2002 .

[21]  Joëlle Coutaz,et al.  A design space for multimodal systems: concurrent processing and data fusion , 1993, INTERCHI.

[22]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[23]  Brian F. Goldiez,et al.  A Paradigm Shift in Interactive Computing: Deriving Multimodal Design Principles from Behavioral and Neurological Foundations , 2004, Int. J. Hum. Comput. Interact..

[24]  Li Deng,et al.  Distributed speech processing in miPad's multimodal user interface , 2002, IEEE Trans. Speech Audio Process..

[25]  Sharon Oviatt,et al.  Multimodal Interfaces , 2008, Encyclopedia of Multimedia.

[26]  James L. Flanagan,et al.  Multimodal interaction on PDA's integrating speech and pen inputs , 2003, INTERSPEECH.

[27]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[28]  Alexander H. Waibel,et al.  Multimodal error correction for speech user interfaces , 2001, TCHI.

[29]  Philip R. Cohen,et al.  The role of voice in human-machine communication , 1994 .