Audio-visual modeling for bimodal speech recognition

Audio-visual speech recognition is a novel extension of acoustic speech recognition and has received a lot of attention in the last few decades. The main motivation behind bimodal speech recognition is the bimodal characteristics of speech perception and production systems of human beings. The effect of the modeling parameters of hidden Markov models (HMM) on the recognition accuracy of the bimodal speech recognizer is analyzed, a comparative analysis of the different HMMs that can be used in bimodal speech recognition is presented, and finally a novel model, which has been experimentally verified to perform better than others is proposed. Also, the geometric visual features are compared and analyzed for their importance in bimodal speech recognition. One of the unique characteristics of our bimodal speech recognition system is the novel fusion strategy of the acoustic and the visual features, which takes into account the different sampling rates of these two signals. Compared to acoustic only, the audio-visual speech recognition scheme has a much more improved recognition accuracy, especially in the presence of noise.

[1]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[2]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[3]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[4]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[5]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[6]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[7]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[8]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.