A speech-centric perspective for human-computer interface

Speech technology has been playing a central role in enhancing human-machine interactions, especially for small devices for which GUI has obvious limitations. The speech-centric perspective for human-computer interface advanced in this paper derives from the view that speech is the only natural and expressive modality to enable people to access information from and to interact with any device. In this paper, we describe the work conducted at Microsoft Research, in the project codenamed Dr.Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present MiPad as the first Dr.Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones.

[1]  Xuedong Huang,et al.  A unified context-free grammar and n-gram model for spoken language processing , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Li Deng,et al.  High-performance robust speech recognition using stereo training data , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[3]  Sharon L. Oviatt,et al.  Breaking the Robustness Barrier: Recent Progress on the Design of Robust Multimodal Systems , 2002, Adv. Comput..

[4]  Kuansan Wang Implementation of a multimodal dialog system using extended markup languages , 2000, INTERSPEECH.

[5]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[6]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[7]  Ye-Yi Wang A robust parser for spoken language understanding , 1999, EUROSPEECH.

[8]  James L. Flanagan,et al.  Multimodal interaction on PDA's integrating speech and pen inputs , 2003, INTERSPEECH.

[9]  Ramesh A. Gopinath,et al.  The IBM Personal Speech Assistant , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[10]  Li Deng,et al.  Distributed speech processing in miPad's multimodal user interface , 2002, IEEE Trans. Speech Audio Process..

[11]  Aaron E. Rosenberg,et al.  On the implementation of ASR algorithms for hand-held wireless mobile devices , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[12]  Li Deng,et al.  MiPad: a multimodal interaction prototype , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[13]  Vladimir Pavlovic,et al.  Toward multimodal human-computer interface , 1998, Proc. IEEE.

[14]  Antonio M. Peinado,et al.  Model-based compensation of the additive noise for continuous speech recognition. experiments using the Aurora II database and tasks , 2001, INTERSPEECH.

[15]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.