A comparative study of 2d and 3d lip tracking methods for AV ASR

Over the past two decades, many algorithms have been proposed to detect and track a human face and its facial features. Of particular interest to the Automatic Speech Recognition (ASR) community are algorithms that can track the shape of the lips, as such visual speech input can then be used in an auditoryvisual (AV) ASR system to improve the recognition accuracy of traditional audio-only ASR systems, particularly in the presence of acoustic noise. Despite the large number of face and lip tracking algorithms that have been proposed over the years, there is a lack of a comparative study that evaluates such algorithms in the context of AV ASR performance. In this paper, the performance of various 2D and 3D lip tracking algorithms is compared from a point of view of AV ASR. In particular, the focus of this study is on algorithms that use explicit lip models. A number of variants of the recently popular Active Appearance Models (AAMs) are compared with a 3D lip tracking algorithm that uses stereo vision. All performance evaluations are made using the AVOZES data corpus. Index Terms: Lip tracking, auditory-visual automatic speech recognition, active appearance model

[1]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[2]  Roland Göcke,et al.  Monocular and Stereo Methods for AAM Learning from Video , 2007, 2007 IEEE Conference on Computer Vision and Pattern Recognition.

[3]  Alexander H. Waibel,et al.  Real-Time Face and Facial Feature Tracking and Applications , 1998, AVSP.

[4]  Roland Göcke,et al.  The audio-video australian English speech data corpus AVOZES , 2012, INTERSPEECH.

[5]  Richard B. Reilly,et al.  Lessons from speechreading , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[6]  John R. Kender,et al.  Finding skin in color images , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[7]  Juergen Luettin,et al.  Active Shape Models for Visual Speech Feature Extraction , 1996 .

[8]  Roland Göcke,et al.  Iterative Error Bound Minimisation for AAM Alignment , 2006, 18th International Conference on Pattern Recognition (ICPR'06).

[9]  Christian Benoît,et al.  Which components of the face do humans and machines best speechread , 1996 .

[10]  Lionel Revéret,et al.  A New 3D Lip Model for Analysis and Synthesis of Lip Motion In Speech Production , 1998, AVSP.

[11]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[12]  Alex Pentland,et al.  3D lip shapes from video: A combined physical-statistical model , 1998, Speech Commun..

[13]  I. H. Öğüş,et al.  NATO ASI Series , 1997 .

[14]  Jason Mora Saragih The generative learning and discriminative fitting of linear deformable models , 2008 .

[15]  Ralph Gross,et al.  Constructing and Fitting Active Appearance Models With Occlusion , 2004, 2004 Conference on Computer Vision and Pattern Recognition Workshop.

[16]  Roland Göcke 3d Lip Tracking and Co-inertia Analysis for Improved Robustness of Audio-video Automatic Speech Recognition , 2005, AVSP.

[17]  Simon Baker,et al.  Lucas-Kanade 20 Years On: A Unifying Framework , 2004, International Journal of Computer Vision.

[18]  Roland Göcke Current trends in joint audio-video signal processing: a review , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[19]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[20]  Michael Wagner,et al.  Aspects of speaking-face data corpus design methodology , 2004, INTERSPEECH.

[21]  Simon Baker,et al.  Equivalence and efficiency of image alignment algorithms , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[22]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[23]  Timothy F. Cootes,et al.  Lipreading Using Shape, Shading and Scale , 1998, AVSP.

[24]  David C. Gibbon,et al.  Multi-modal system for locating heads and faces , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[25]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[26]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[27]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[28]  Timothy F. Cootes,et al.  Active Shape Models-Their Training and Application , 1995, Comput. Vis. Image Underst..

[29]  Michael Vogt Fast Matching of a Dynamic Lip Model to Color Video Sequences under Regular Illumination Conditions , 1996 .

[30]  Hans-Heinrich Bothe Relations of Audio and Visual Speech Signals in a Physical Feature Space: Implications for the Hearing-impaired , 1996 .