A Low-Complexity Parabolic Lip Contour Model With Speaker Normalization for High-Level Feature Extraction in Noise-Robust Audiovisual Speech Recognition

This paper proposes a novel low-complexity lip contour model for high-level optic feature extraction in noise-robust audiovisual (AV) automatic speech recognition systems. The model is based on weighted least-squares parabolic fitting of the upper and lower lip contours, does not require the assumption of symmetry across the horizontal axis of the mouth, and is therefore realistic. The proposed model does not depend on the accurate estimation of specific facial points, as do other high-level models. Also, we present a novel low-complexity algorithm for speaker normalization of the optic information stream, which is compatible with the proposed model and does not require parameter training. The use of the proposed model with speaker normalization results in noise robustness improvement in AV isolated-word recognition relative to using the baseline high-level model.

[1]  Thomas S. Huang,et al.  Real-time lip tracking and bimodal continuous speech recognition , 1998, 1998 IEEE Second Workshop on Multimedia Signal Processing (Cat. No.98EX175).

[2]  Aggelos K. Katsaggelos,et al.  Audio-Visual Speech Recognition Using MPEG-4 Compliant Visual Features , 2002, EURASIP J. Adv. Signal Process..

[3]  Olli Viikki,et al.  Cepstral domain segmental feature vector normalization for noise robust speech recognition , 1998, Speech Commun..

[4]  Abeer Alwan,et al.  Adaptation of children's speech with limited data based on formant-like peak alignment , 2006, Comput. Speech Lang..

[5]  Petros Maragos,et al.  Multimodal Fusion and Learning with Uncertain Features Applied to Audiovisual Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.

[6]  Kevin P. Murphy,et al.  Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[7]  A. Sheikholeslami,et al.  Real-time face detection and lip feature extraction using field-programmable gate arrays , 2006, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[8]  Aggelos K. Katsaggelos,et al.  Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[10]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[11]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[12]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[14]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[15]  Jian Zhang,et al.  Analysis of lip geometric features for audio-visual speech recognition , 2004, IEEE Trans. Syst. Man Cybern. Part A.

[16]  Evandro B. Gouvêa,et al.  Speaker normalization through formant-based warping of the frequency scale , 1997, EUROSPEECH.

[17]  Mark Hasegawa-Johnson A Multi-Stream Approach to Audiovisual Automatic Speech Recognition , 2007, 2007 IEEE 9th Workshop on Multimedia Signal Processing.