Multipose audio-visual speech recognition

In this paper we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on the effects of a changing pose of the speaker relative to the camera, a problem encountered in natural situations. To that purpose, we introduce a pose normalization technique and perform speech recognition from multiple views by generating virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition studies and relies on linear regression to find an approximate mapping between images from different poses. Lipreading experiments quantify the loss of performance related to pose changes and the proposed pose normalization techniques, while audio-visual results analyse how an audio-visual system should account for non-frontal poses in terms of the weight assigned to the visual modality in the audio-visual classifier.

[1]  Ralph Gross,et al.  Appearance-based face recognition and light-fields , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[3]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[4]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[5]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[6]  Sridha Sridharan,et al.  Continuous pose-invariant lipreading , 2008, INTERSPEECH.

[7]  Sridha Sridharan,et al.  A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.

[8]  David Beymer,et al.  Face recognition under varying pose , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[9]  Chalapathy Neti,et al.  Audio-visual speech recognition in challenging environments , 2003, INTERSPEECH.

[10]  Sridha Sridharan,et al.  An extended pose-invariant lipreading system , 2007, AVSP.

[11]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[12]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Thomas Vetter,et al.  Synthesis of Novel Views from a Single Face Image , 1998, International Journal of Computer Vision.

[14]  Steve Young,et al.  The HTK book , 1995 .

[15]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[16]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[17]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[18]  P. Jonathon Phillips,et al.  Face recognition based on frontal views generated from non-frontal images , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[19]  Wen Gao,et al.  Locally Linear Regression for Pose-Invariant Face Recognition , 2007, IEEE Transactions on Image Processing.

[20]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.