Adaptive bimodal sensor fusion for automatic speechreading

We present work on improving the performance of automated speech recognizers by using additional visual information: (lip-/speechreading); achieving error reduction of up to 50%. This paper focuses on different methods of combining the visual and acoustic data to improve the recognition performance. We show this on an extension of an existing state-of-the-art speech recognition system, a modular MS-TDNN. We have developed adaptive combination methods at several levels of the recognition network. Additional information such as estimated signal-to-noise ratio (SNR) is used in some cases. The results of the different combination methods are shown for clean speech and data with artificial noise (white, music, motor). The new combination methods adapt automatically to varying noise conditions making hand-tuned parameters unnecessary.

[1]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[4]  Q. Summerfield,et al.  Intermodal timing relations and audio-visual speech recognition by normal-hearing adults. , 1985, The Journal of the Acoustical Society of America.

[5]  B. Walden,et al.  Effects of consonantal context on vowel lipreading. , 1981, Journal of speech and hearing research.

[6]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[7]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[8]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[9]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[10]  Alexander H. Waibel,et al.  Speaker-independent connected letter recognition with a multi-state time delay neural network , 1992, EUROSPEECH.

[11]  Alexander H. Waibel,et al.  Improving connected letter recognition by lipreading , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Alan C. Bovik,et al.  Audio-visual speech recognition for a vowel discrimination task , 1993, Other Conferences.

[13]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[14]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[15]  Alexander H. Waibel,et al.  Toward movement-invariant automatic lip-reading and speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.