论文信息 - Automatic speech recognition system using acoustic and visual signals

Automatic speech recognition system using acoustic and visual signals

Automatic speech-reading systems use both acoustic and visual signals to perform speech recognition. In previous work, we have shown how visual speech can improve recognition accuracy of automatic speech recognition and have described an algorithm based on deformable templates that accurately infers lip dynamics. In this paper we present a complete speech-reading system, which is able to record an utterance using a standard color video camera, preprocess both the audio and video signal, and perform speech recognition. This system is based on new algorithms for finding the talker's face and mouth and an improved template algorithm for tracking the lips. We will also compare the results from our new system with our previous work and discuss various strategies for integration of the two modalities.

M.E. Hennecke | K.V. Prasad | D.G. Stork

[1] Gregory J. Wolff,et al. Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[2] E. Petajan,et al. An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[3] B. Wandell. Foundations of vision , 1995 .

[4] Gregory J. Wolff,et al. Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration , 1993, NIPS.

[5] Alan C. Bovik,et al. Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6] Terrence J. Sejnowski,et al. Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[7] David G. Stork,et al. Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[8] David G. Stork,et al. Speechreading by Humans and Machines , 1996 .