Computer lipreading for improved accuracy in automatic speech recognition

Among the various methods that have been proposed to improve the robustness and accuracy of automatic speech recognition (ASR) systems, lipreading has received little attention until very recently. However, results from the psychological literature indicate that lipreading, in conjunction with auditory perception, can provide a strong improvement in speech recognition and understanding in humans. We have developed a novel speaker-dependent lipreading system that uses hidden Markov models. An audiovisual system known as Lipreading to Enhance Automatic Perception of Speech (LEAPS) is described, in which the lipreading system is used in conjunction with an audio ASR system in order to improve the accuracy of the latter, especially under degraded acoustical conditions. Experimental results are presented for two small phoneme discrimination tasks, as well as a medium vocabulary isolated word recognition task. In all cases, performance of the combined system is superior to that of the audio system, with a reduction in errors ranging from 20 to 65%.

[1]  Mitch Weintraub,et al.  Energy conditioned spectral estimation for recognition of noisy speech , 1993, IEEE Trans. Speech Audio Process..

[2]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[3]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[4]  D. Massaro Speech Perception By Ear and Eye: A Paradigm for Psychological Inquiry , 1989 .

[5]  Stephen M. Omohundro,et al.  Surface Learning with Applications to Lipreading , 1993, NIPS.

[6]  Yochai Konig,et al.  A hybrid approach to bimodal speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[7]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[8]  S. Nishida Speech recognition enhancement by lip information , 1986, CHI '86.

[9]  Alex Waibel Prosodic knowledge sources for word hypothesization in a continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[11]  Wayne H. Ward,et al.  High level knowledge sources in usable speech recognition systems , 1990 .

[12]  Piero Cosi,et al.  Bimodal recognition experiments with recurrent neural networks , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[13]  William Equitz,et al.  A new vector quantization clustering algorithm , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Alexander H. Waibel,et al.  See Me, Hear Me: Integrating Automatic Speech Recognition and Lip-reading , 1994 .

[15]  Kenji Kurosu,et al.  Neural network vowel-recognition jointly using voice features and mouth shape image , 1991, Pattern Recognit..

[16]  Dana H. Ballard,et al.  Computer Vision , 1982 .

[17]  P. L. Silsbee Sensory integration in audiovisual automatic speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[18]  Hynek Hermansky,et al.  Evaluation and optimization of perceptually-based ASR front-end , 1993, IEEE Trans. Speech Audio Process..

[19]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[20]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[21]  W. A. Woods,et al.  Language processing for speech understanding , 1986 .

[22]  Yariv Ephraim Gain-adapted hidden Markov models for recognition of clean and noisy speech , 1992, IEEE Trans. Signal Process..

[23]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[24]  Jean-Paul Haton,et al.  Automatic Recognition of Noisy Speech , 1995 .

[25]  R. M. Mersereau,et al.  Lip modeling for visual speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[26]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[27]  Javier R. Movellan,et al.  Visual Speech Recognition with Stochastic Networks , 1994, NIPS.

[28]  Gregory J. Wolff,et al.  Lipreading by Neural Networks: Visual Preprocessing, Learning, and Sensory Integration , 1993, NIPS.

[29]  Raj Reddy,et al.  Large-vocabulary speaker-independent continuous speech recognition: the sphinx system , 1988 .

[30]  Vishwa Gupta,et al.  Integration of acoustic information in a large vocabulary word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[31]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[32]  A. Montgomery,et al.  Perceptual dimensions underlying vowellipreading performance. , 1976, Journal of speech and hearing research.

[33]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[34]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[35]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.