Recognition of visual speech elements using adaptively boosted hidden Markov models

The performance of automatic speech recognition (ASR) system can be significantly enhanced with additional information from visual speech elements such as the movement of lips, tongue, and teeth, especially under noisy environment. In this paper, a novel approach for recognition of visual speech elements is presented. The approach makes use of adaptive boosting (AdaBoost) and hidden Markov models (HMMs) to build an AdaBoost-HMM classifier. The composite HMMs of the AdaBoost-HMM classifier are trained to cover different groups of training samples using the AdaBoost technique and the biased Baum-Welch training method. By combining the decisions of the component classifiers of the composite HMMs according to a novel probability synthesis rule, a more complex decision boundary is formulated than using the single HMM classifier. The method is applied to the recognition of the basic visual speech elements. Experimental results show that the AdaBoost-HMM classifier outperforms the traditional HMM classifier in accuracy, especially for visemes extracted from contexts.

[1]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[3]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[4]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[5]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6]  Say Wei Foo,et al.  A simplified viterbi matching algorithm for word partition in visual speech processing , 2003 .

[7]  Pedro F. Felzenszwalb Deformable Templates I , 2006 .

[8]  Say Wei Foo Speaker Recognition Using Adaptively Boosted Classifiers , 2003 .

[9]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[10]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[11]  Aggelos K. Katsaggelos,et al.  An HMM-based speech-to-video synthesizer , 2002, IEEE Trans. Neural Networks.

[12]  Gregory J. Wolff,et al.  Neural network lipreading system for improved speech recognition , 1992, [Proceedings 1992] IJCNN International Joint Conference on Neural Networks.

[13]  Alan C. Bovik,et al.  Medium Vocabulary Audiovisual Speech Recognition , 1995 .

[14]  Liang Dong,et al.  A boosted multi-HMM classifier for recognition of visual speech elements , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Chalapathy Neti,et al.  Asynchrony modeling for audio-visual speech recognition , 2002 .

[16]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[17]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[18]  Mark A. Clements,et al.  Audio-visual speech recognition by speechreading , 2002, 2002 14th International Conference on Digital Signal Processing Proceedings. DSP 2002 (Cat. No.02TH8628).

[19]  Liang Dong,et al.  Recognition of Visual Speech Elements Using Hidden Markov Models , 2002, IEEE Pacific Rim Conference on Multimedia.

[20]  Allen A. Montgomery,et al.  Automatic optically-based recognition of speech , 1988, Pattern Recognit. Lett..

[21]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[22]  Gerasimos Potamianos,et al.  Speaker independent audio-visual database for bimodal ASR , 1997, AVSP.

[23]  Satoshi Nakamura,et al.  Audio-visual speech translation with automatic lip syncqronization and face tracking based on 3-D head model , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  A A Montgomery,et al.  Auditory and visual contributions to the perception of consonants. , 1974, Journal of speech and hearing research.

[25]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[26]  Farzin Deravi,et al.  Design issues for a digital audio-visual integrated database , 1996 .

[27]  John H. L. Hansen,et al.  Selective training for hidden Markov models with applications to speech classification , 1999, IEEE Trans. Speech Audio Process..

[28]  J. Baker,et al.  The DRAGON system--An overview , 1975 .

[29]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[30]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[31]  Robert E. Schapire,et al.  A Brief Introduction to Boosting , 1999, IJCAI.

[32]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[33]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[34]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[35]  Trent W. Lewis,et al.  Lip Feature Extraction Using Red Exclusion , 2000, VIP.

[36]  B.P. Yuhas,et al.  Integration of acoustic and visual speech signals using neural networks , 1989, IEEE Communications Magazine.

[37]  Lorenzo Torresani,et al.  2D Deformable Models for Visual Speech Analysis , 1996 .

[38]  Garrison W. Greenwood Training partially recurrent neural networks using evolutionary strategies , 1997, IEEE Trans. Speech Audio Process..

[39]  D. Reisberg,et al.  Easy to hear but hard to understand: A lip-reading advantage with intact auditory stimuli. , 1987 .

[40]  David G. Stork,et al.  Using deformable templates to infer visual speech dynamics , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[41]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[42]  Alan Jeffrey Goldschen,et al.  Continuous automatic speech recognition by lipreading , 1993 .

[43]  David G. Stork,et al.  Speechreading: an overview of image processing, feature extraction, sensory integration and pattern recognition techniques , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[44]  Robert E. Schapire,et al.  Theoretical Views of Boosting and Applications , 1999, ALT.

[45]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[46]  Oscar N. Garcia,et al.  Continuous optical automatic speech recognition by lipreading , 1994, Proceedings of 1994 28th Asilomar Conference on Signals, Systems and Computers.

[47]  Stephen M. Omohundro,et al.  Nonlinear manifold learning for visual speech recognition , 1995, Proceedings of IEEE International Conference on Computer Vision.

[48]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[49]  Ruth Campbell Seeing Brains Reading Speech: A Review and Speculations , 1996 .

[50]  Mark A. Clements,et al.  Automatic Speechreading with Applications to Human-Computer Interfaces , 2002, EURASIP J. Adv. Signal Process..

[51]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[52]  Juergen Luettin,et al.  Speechreading using shape and intensity information , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.