Multi-pose lipreading and audio-visual speech recognition

In this article, we study the adaptation of visual and audio-visual speech recognition systems to non-ideal visual conditions. We focus on overcoming the effects of a changing pose of the speaker, a problem encountered in natural situations where the speaker moves freely and does not keep a frontal pose with relation to the camera. To handle these situations, we introduce a pose normalization block in a standard system and generate virtual frontal views from non-frontal images. The proposed method is inspired by pose-invariant face recognition and relies on linear regression to find an approximate mapping between images from different poses. We integrate the proposed pose normalization block at different stages of the speech recognition system and quantify the loss of performance related to pose changes and pose normalization techniques. In audio-visual experiments we also analyze the integration of the audio and visual streams. We show that an audio-visual system should account for non-frontal poses and normalization techniques in terms of the weight assigned to the visual stream in the classifier.

[1]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[2]  S. Furui,et al.  Cepstral analysis technique for automatic speaker verification , 1981 .

[3]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .

[4]  Ieee Xplore,et al.  IEEE Transactions on Pattern Analysis and Machine Intelligence Information for Authors , 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[6]  L. Rabiner,et al.  An introduction to hidden Markov models , 1986, IEEE ASSP Magazine.

[7]  R. Campbell,et al.  Hearing by eye : the psychology of lip-reading , 1988 .

[8]  David Taylor Hearing by Eye: The Psychology of Lip-Reading , 1988 .

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[11]  Milan Sonka,et al.  Image Processing, Analysis and Machine Vision , 1993, Springer US.

[12]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[13]  Roberto Battiti,et al.  Using mutual information for selecting features in supervised neural net learning , 1994, IEEE Trans. Neural Networks.

[14]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  David Beymer,et al.  Face recognition under varying pose , 1994, 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[17]  Steve Young,et al.  The HTK book , 1995 .

[18]  Christopher M. Bishop,et al.  Neural networks for pattern recognition , 1995 .

[19]  D. Stork,et al.  Speechreading by Man and Machine: Models, Systems, and Applications , 1996 .

[20]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[21]  Javier R. Movellan,et al.  Channel Separability in the Audio-Visual Integration of Speech: A Bayesian Approach , 1996 .

[22]  Martin J. Russell,et al.  Integrating audio and visual information to provide highly robust speech recognition , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[23]  David G. Stork,et al.  Speechreading by Humans and Machines , 1996 .

[24]  平山亮 会議報告-Speechreading by Humans and Machines; Models Systems and Applications , 1997 .

[25]  Jiri Matas,et al.  On Combining Classifiers , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[26]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[27]  David G. Stork,et al.  Speech recognition and sensory integration , 1998 .

[28]  Jeff A. Bilmes,et al.  Dynamic classifier combination in hybrid speech recognition systems using utterance-level confidence values , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[29]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[30]  Hervé Glotin,et al.  Large-vocabulary audio-visual speech recognition: a summary of the Johns Hopkins Summer 2000 Workshop , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[31]  Sharon M. Thomas,et al.  Effects of horizontal viewing angle on visual and audiovisual speech recognition. , 2001, Journal of experimental psychology. Human perception and performance.

[32]  Juergen Luettin,et al.  Hierarchical discriminant features for audio-visual LVCSR , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[33]  Tsuhan Chen,et al.  Audiovisual speech processing , 2001, IEEE Signal Process. Mag..

[34]  Marcos Dipinto,et al.  Discriminant analysis , 2020, Predictive Analytics.

[35]  Mark A. Clements,et al.  Automatic Speechreading with Applications to Human-Computer Interfaces , 2002, EURASIP J. Adv. Signal Process..

[36]  Sabri Gurbuz,et al.  Moving-Talker, Speaker-Independent Feature Study, and Baseline Results Using the CUAVE Multimodal Speech Corpus , 2002, EURASIP J. Adv. Signal Process..

[37]  Chalapathy Neti,et al.  Audio-visual speech recognition in challenging environments , 2003, INTERSPEECH.

[38]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[39]  Daniel P. W. Ellis,et al.  Using mutual information to design class-specific phone recognizers , 2003, INTERSPEECH.

[40]  Thomas Vetter,et al.  Face Recognition Based on Fitting a 3D Morphable Model , 2003, IEEE Trans. Pattern Anal. Mach. Intell..

[41]  Surendra Ranganath,et al.  Pose-invariant face recognition using a 3D deformable model , 2003, Pattern Recognit..

[42]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[43]  Ralph Gross,et al.  Appearance-based face recognition and light-fields , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[45]  Jian Zhang,et al.  Analysis of lip geometric features for audio-visual speech recognition , 2004, IEEE Trans. Syst. Man Cybern. Part A.

[46]  F. Fleuret Fast Binary Feature Selection with Conditional Mutual Information , 2004, J. Mach. Learn. Res..

[47]  Thomas Vetter,et al.  Synthesis of Novel Views from a Single Face Image , 1998, International Journal of Computer Vision.

[48]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  P. Jonathon Phillips,et al.  Face recognition based on frontal views generated from non-frontal images , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[50]  A. Murat Tekalp,et al.  Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading , 2006, IEEE Transactions on Image Processing.

[51]  Sridha Sridharan,et al.  A unified approach to multi-pose audio-visual ASR , 2007, INTERSPEECH.

[52]  Wen Gao,et al.  Locally Linear Regression for Pose-Invariant Face Recognition , 2007, IEEE Transactions on Image Processing.

[53]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[54]  Sridha Sridharan,et al.  An extended pose-invariant lipreading system , 2007, AVSP.

[55]  Simon King,et al.  Articulatory Feature-Based Methods for Acoustic and Audio-Visual Speech Recognition: Summary from the 2006 JHU Summer workshop , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[56]  Sridha Sridharan,et al.  Continuous pose-invariant lipreading , 2008, INTERSPEECH.

[57]  Jean-Philippe Thiran,et al.  Information Theoretic Feature Extraction for Audio-Visual Speech Recognition , 2009, IEEE Transactions on Signal Processing.

[58]  Jean-Philippe Thiran,et al.  Multipose audio-visual speech recognition , 2011, 2011 19th European Signal Processing Conference.