Haptic Voice Recognition: Augmenting speech modality with touch events for efficient speech recognition

This paper proposes the Haptic Voice Recognition (HVR), a multi-modal interface that combines speech and touch sensory inputs to perform voice recognition. These touch inputs form a series of haptic events that provide cues or ‘landmarks’ for word boundaries. These word boundary cues greatly reduce the search space for speech recognition, thereby making the decoding process more efficient and suitable for portable devices with limited compute and memory resources. Furthermore, having the knowledge of word boundaries also suppresses insertion and deletion errors. This is particularly helpful when recognition is performed in noisy environment. In this paper, a series of experiments were conducted to study the feasibility of augmenting touch events to automatic speech recognition and explore its potential benefits. Experiments were conducted with syntactically simulated haptic events on the Wall Street Journal database as well as realistic haptic events acquired using a prototype HVR interface implemented on a touchscreen device.

[1]  Juan Manuel Górriz,et al.  Voice Activity Detection. Fundamentals and Speech Recognition System Robustness , 2007 .

[2]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[3]  Hermann Ney,et al.  Dynamic programming search for continuous speech recognition , 1999, IEEE Signal Process. Mag..

[4]  Steve Young,et al.  Token passing: a simple conceptual model for connected speech recognition systems , 1989 .

[5]  Mark J. F. Gales,et al.  Development of the 2003 CU-HTK conversational telephone speech transcription system , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Steve Young,et al.  The HTK book version 3.4 , 2006 .

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..

[9]  Slava M. Katz,et al.  Estimation of probabilities from sparse data for the language model component of a speech recognizer , 1987, IEEE Trans. Acoust. Speech Signal Process..

[10]  Thomas Hain,et al.  Recent advances in broadcast news transcription , 2003 .

[11]  Hermann Ney,et al.  Look-ahead techniques for fast beam search , 2000, Comput. Speech Lang..

[12]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[13]  S. Young Large Vocabulary Continuous Speech Recognition : a ReviewSteve , 1996 .

[14]  Richard M. Stern,et al.  Efficient Cepstral Normalization for Robust Speech Recognition , 1993, HLT.

[15]  Imre Kiss A comparison of distributed and network speech recognition for mobile communication systems , 2000, INTERSPEECH.