Effective lip localization and tracking for achieving multimodal speech recognition

Effective fusion of acoustic and visual modalities in speech recognition has been an important issue in human computer interfaces, warranting further improvements in intelligibility and robustness. Speaker lip motion stands out as the most linguistically relevant visual feature for speech recognition. In this paper, we present a new hybrid approach to improve lip localization and tracking, aimed at improving speech recognition in noisy environments. This hybrid approach begins with a new color space transformation for enhancing lip segmentation. In the color space transformation, a PCA method is employed to derive a new one dimensional color space which maximizes discrimination between lip and non-lip colors. Intensity information is also incorporated in the process to improve contrast of upper and corner lip segments. In the subsequent step, a constrained deformable lip model with high flexibility is constructed to accurately capture and track lip shapes. The model requires only six degrees of freedom, yet provides a precise description of lip shapes using a simple least square fitting method. Experimental results indicate that the proposed hybrid approach delivers reliable and accurate localization and tracking of lip motions under various measurement conditions.

[1]  T. Poggio,et al.  Synthesizing a color algorithm from examples. , 1988, Science.

[2]  Trent W. Lewis,et al.  Lip Feature Extraction Using Red Exclusion , 2000, VIP.

[3]  N. Otsu A threshold selection method from gray level histograms , 1979 .

[4]  Tomaso Poggio,et al.  Synthesizing a color algorithm from examples , 1988 .

[5]  Shu Hung Leung,et al.  A new real-time lip contour extraction algorithm , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[6]  Michael T. Chan,et al.  Automatic lip model extraction for constrained contour-based tracking , 1999, Proceedings 1999 International Conference on Image Processing (Cat. 99CH36348).

[7]  Anil K. Jain,et al.  Face Detection in Color Images , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  Akio Ogihara,et al.  Speech Recognition Based on Fusion of Visual and Auditory Information Using Full-Frame Color Image (Special Section of Letters Selected from the 1996 IEICE General Conference) , 1996 .

[9]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[10]  Shigeru Akamatsu,et al.  Comparative performance of different skin chrominance models and chrominance spaces for the automatic detection of human faces in color images , 2000, Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580).