Bimodal combination of speech and handwriting for improved word recognition

This paper presents a multimodal interface combining the use of speech and handwriting for isolated word recognition. Automatic Speech Recognition accuracy decreases as the perplexity of the task increases with the vocabulary size and the level of noise. The combination of different input modalities can improve the recognition performance. Handwriting is a modality that is natural to use, and can replace a keyboard on small portable devices, like Tablet PC's or PDA's. However this input method can be quite slow by itself. The proposed method in this paper combines both modalities by using handwriting to input only the first letters of a word, and speech to complete the word. The platform used to test this combination was a Tablet PC, using the Windows XP Tablet PC integrated handwriting recognition engine. Experiments were done based on a vocabulary of 35000 words. Relative word recognition improvements as high as 53% were obtained.

[1]  Philip R. Cohen,et al.  QuickSet: multimodal interaction for distributed applications , 1997, MULTIMEDIA '97.

[2]  J.J. Sudano Equivalence between belief theories and naive bayesian fusion for systems with independent evidential data: part I, the theory , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[3]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[4]  Rajeev Sharma,et al.  Toward Natual Gesture/Speech HCI: A Case Study of Weather Narration , 1998 .

[5]  Li Deng,et al.  Mipad: a next generation PDA prototype , 2000, INTERSPEECH.

[6]  Richard A. Bolt,et al.  “Put-that-there”: Voice and gesture at the graphics interface , 1980, SIGGRAPH '80.

[7]  Juergen Luettin,et al.  Audio-Visual Speech Modelling for Continuous Speech Recognition , 2000 .

[8]  Alexander H. Waibel,et al.  NPen/sup ++/: a writer independent, large vocabulary on-line cursive handwriting recognition system , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[9]  C. Tadj,et al.  Dynamic multi-agent architecture for multimedia multimodal dialogs , 2002, Proceedings. IEEE Workshop on Knowledge Media Networking.

[10]  James L. Flanagan,et al.  Multimodal interaction on PDA's integrating speech and pen inputs , 2003, INTERSPEECH.

[11]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[12]  Richard M. Schwartz,et al.  On-line cursive handwriting recognition using speech recognition methods , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[14]  Christophe Ris,et al.  Robust feature extraction and acoustic modeling at multitel: experiments on the Aurora databases , 2003, INTERSPEECH.

[15]  J. J. Sudano Equivalence between belief theories and naive bayesian fusion for systems with independent evidential data: part II, the example , 2003, Sixth International Conference of Information Fusion, 2003. Proceedings of the.

[16]  Sharon L. Oviatt,et al.  Multimodal Integration - A Statistical View , 1999, IEEE Trans. Multim..