Continuous Audio-Visual Speech Recognition

We address the problem of robust lip tracking, visual speech feature extraction, and sensor integration for audio-visual speech recognition applications. An appearance based model of the articulators, which represents linguistically important features, is learned from example images and is used to locate, track, and recover visual speech information. We tackle the problem of joint temporal modelling of the acoustic and visual speech signals by applying Multi-Stream hidden Markov models. This approach allows the use of different temporal topologies and levels of stream integration and hence enables to model temporal dependencies more accurately. The system has been evaluated for a continuously spoken digit recognition task of 37 subjects.

[1]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[2]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Harvey b. Fletcher,et al.  Speech and hearing in communication , 1953 .

[4]  L. Braida Crossmodal Integration in the Identification of Consonant Segments , 1991, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[5]  Jiri Matas,et al.  Statistical Chromaticity Models for Lip Tracking with B-splines , 1997, AVBPA.

[6]  Jont B. Allen How do humans process and recognize speech , 1993 .

[7]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[8]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[9]  Alex Pentland,et al.  Facial expression recognition using a dynamic model and motion energy , 1995, Proceedings of IEEE International Conference on Computer Vision.

[10]  Juergen Luettin,et al.  Visual Speech and Speaker Recognition , 1997 .

[11]  Hervé Bourlard,et al.  Multi-Stream Speech Recognition , 1996 .

[12]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[13]  Luc Vandendorpe,et al.  The M2VTS Multimodal Face Database (Release 1.00) , 1997, AVBPA.

[14]  Les E. Atlas,et al.  The challenge of spoken language systems: research directions for the nineties , 1995, IEEE Trans. Speech Audio Process..

[15]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[16]  Michael I. Jordan,et al.  Hidden Markov Decision Trees , 1996, NIPS.

[17]  J. L. Miller,et al.  On the role of visual rate information in phonetic perception , 1985, Perception & psychophysics.

[18]  N. Isshiki Physiology of Speech Production , 1989 .

[19]  Alex Pentland,et al.  Automatic lipreading by optical-flow analysis , 1989 .

[20]  Gérard Chollet,et al.  Swiss French PolyPhone and PolyVar: telephone speech databases to model inter- and intra-speaker variability , 1996 .

[21]  Q. Summerfield,et al.  Lipreading and audio-visual speech perception. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[22]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[23]  Timothy F. Cootes,et al.  Use of active shape models for locating structures in medical images , 1994, Image Vis. Comput..

[24]  N. P. Erber,et al.  Voice/mouth synthesis and tactual/visual perception of /pa, ba, ma/. , 1978, The Journal of the Acoustical Society of America.

[25]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[26]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[27]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[28]  Timothy F. Cootes,et al.  Automatic Interpretation and Coding of Face Images Using Flexible Models , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[29]  Hervé Bourlard,et al.  Subband-based speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[30]  Y. J. Tejwani,et al.  Robot vision , 1989, IEEE International Symposium on Circuits and Systems,.

[31]  Alex Pentland,et al.  3D Modeling of Human Lip Motion , 1998, ICCV.

[32]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[33]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[34]  Alex Pentland,et al.  Probabilistic visual learning for object detection , 1995, Proceedings of IEEE International Conference on Computer Vision.

[35]  Alex Pentland,et al.  3D modeling and tracking of human lip motions , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).