HMM modeling for audio-visual speech recognition

Bimodal speech recognition is a robust technique for automated speech analysis, and has received a lot of attention in the last few decades. In this paper, we analyze the effect of the HMM models on the performance of the bimodal speech recognizer, present a comparative analysis of the different HMM models that can be used in bimodal speech recognition, and finally propose a novel model, which has been experimentally verified to perform better than others. One of the unique characteristic of our HMM model is the novel fusion strategy of the acoustic and the visual features, that takes into account the different sampling rates of these two signals. Compared to audio only, the bimodal speech recognition scheme has a much more improved recognition accuracy, especially in presence of noise.

[1]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[4]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[5]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[6]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[7]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.