Audio-Visual Speech Modeling for Continuous Speech Recognition

This paper describes a speech recognition system that uses both acoustic and visual speech information to improve recognition performance in noisy environments. The system consists of three components: a visual module; an acoustic module; and a sensor fusion module. The visual module locates and tracks the lip movements of a given speaker and extracts relevant speech features. This task is performed with an appearance-based lip model that is learned from example images. Visual speech features are represented by contour information of the lips and grey-level information of the mouth area. The acoustic module extracts noise-robust features from the audio signal. Finally the sensor fusion module is responsible for the joint temporal modeling of the acoustic and visual feature streams and is realized using multistream hidden Markov models (HMMs). The multistream method allows the definition of different temporal topologies and levels of stream integration and hence enables the modeling of temporal dependencies more accurately than traditional approaches. We present two different methods to learn the asynchrony between the two modalities and how to incorporate them in the multistream models. The superior performance for the proposed system is demonstrated on a large multispeaker database of continuously spoken digits. On a recognition task at 15 dB acoustic signal-to-noise ratio (SNR), acoustic perceptual linear prediction (PLP) features lead to 56% error rate, noise robust RASTA-PLP (relative spectra) acoustic features to 7.2% error rate and combined noise robust acoustic features and visual features to 2.5% error rate.

[1]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[3]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[4]  Jont B. Allen How do humans process and recognize speech , 1993 .

[5]  H. Sakoe,et al.  Two-level DP-matching--A dynamic programming-based pattern matching algorithm for connected word recognition , 1979 .

[6]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[7]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[8]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[9]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[12]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[13]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[14]  H. Sakoe,et al.  Two-level DP-matching algorithm-a dynamic programming based pattern matching algorithm for continuous speech recognition , 1979 .

[15]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[16]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[17]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[18]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[19]  Gérard Chollet,et al.  Swiss French PolyPhone and PolyVar: telephone speech databases to model inter- and intra-speaker variability , 1996 .

[20]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[21]  L. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[22]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[23]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[24]  Mervyn A. Jack,et al.  Weighted Viterbi algorithm and state duration modelling for speech recognition in noise , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[25]  Gerasimos Potamianos,et al.  Discriminative training of HMM stream exponents for audio-visual speech recognition , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[26]  Steven Greenberg,et al.  Speech intelligibility in the presence of cross-channel spectral asynchrony , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[27]  Gérard Chollet,et al.  Toward Markov random field modeling of speech , 1998, ICSLP.

[28]  Vincent J. van Heuven,et al.  Intelligibility of audio-visually desynchronised speech: asymmetrical effect of phoneme position , 1992, ICSLP.

[29]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[31]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[32]  Dominic W. Massaro,et al.  Perceiving asynchronous bimodal speech in consonant-vowel and vowel syllables , 1993, Speech Commun..

[33]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[34]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[35]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[36]  Louis D. Braida,et al.  Evaluating the articulation index for auditory-visual input. , 1987, The Journal of the Acoustical Society of America.

[37]  N. P. Erber,et al.  Voice/mouth synthesis and tactual/visual perception of /pa, ba, ma/. , 1978, The Journal of the Acoustical Society of America.

[38]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[39]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[40]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[41]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[42]  Michael I. Jordan,et al.  Hidden Markov Decision Trees , 1996, NIPS.

[43]  J. L. Miller,et al.  On the role of visual rate information in phonetic perception , 1985, Perception & psychophysics.

[44]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[45]  Les E. Atlas,et al.  The challenge of spoken language systems: research directions for the nineties , 1995, IEEE Trans. Speech Audio Process..

[46]  Timothy F. Cootes,et al.  Automatic Interpretation and Coding of Face Images Using Flexible Models , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[47]  Christophe Ris,et al.  Assessing local noise level estimation methods: Application to noise robust ASR , 2000, Speech Commun..

[48]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..