Audiovisual speech recognition using multiscale nonlinear image decomposition

There has recently been increasing interest in the idea of enhancing speech recognition by the use of visual information derived from the face of the talker. This paper demonstrates the use of nonlinear image decomposition, in the form of a "sieve", applied to the task of visual speech recognition. Information derived from the mouth region is used in visual and audio-visual speech recognition of a database of the letters A-Z for four talkers. A scale histogram is generated directly from the gray-scale pixels of a window containing the talker's mouth on a per-frame basis. Results are presented for visual-only, audio-only and a simple audio-visual case.

[1]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[2]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[3]  Yochai Konig,et al.  A hybrid approach to bimodal speech recognition , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[4]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[5]  Alan C. Bovik,et al.  Medium Vocabulary Audiovisual Speech Recognition , 1995 .

[6]  J. Andrew Bangham,et al.  Scale-space from nonlinear filters , 1995, Proceedings of IEEE International Conference on Computer Vision.

[7]  J. Andrew Bangham,et al.  Multiscale recursive medians, scale-space, and transforms with applications to image processing , 1996, IEEE Trans. Image Process..

[8]  Peter L. Silsbee,et al.  Audiovisual Sensory Integration Using Hidden Markov Models , 1996 .

[9]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[10]  J. Andrew Bangham,et al.  Nonlinear Scale-Space from n-Dimensional Sieves , 1996, ECCV.

[11]  Andrew Blake,et al.  Real-Time Lip Tracking for Audio-Visual Speech Recognition Applications , 1996, ECCV.

[12]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[13]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[14]  Jean-Luc Schwartz,et al.  Exploiting sensor fusion architectures and stimuli complementarity in AV speech recognition , 1996 .

[15]  Pierre Chardaire,et al.  Multiscale Nonlinear Decomposition: The Sieve Decomposition Theorem , 1996, IEEE Trans. Pattern Anal. Mach. Intell..