Dynamic visual features for audio-visual speaker verification

The cascading appearance-based (CAB) feature extraction technique has established itself as the state-of-the-art in extracting dynamic visual speech features for speech recognition. In this paper, we will focus on investigating the effectiveness of this technique for the related speaker verification application. By investigating the speaker verification ability of each stage of the cascade we will demonstrate that the same steps taken to reduce static speaker and environmental information for the visual speech recognition application also provide similar improvements for visual speaker recognition. A further study is conducted comparing synchronous HMM (SHMM) based fusion of CAB visual features and traditional perceptual linear predictive (PLP) acoustic features to show that higher complexity inherit in the SHMM approach does not appear to provide any improvement in the final audio-visual speaker verification system over simpler utterance level score fusion.

[1]  Sridha Sridharan,et al.  Audio-visual speaker verification using continuous fused HMMs , 2006 .

[2]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[3]  Paul A. Viola,et al.  Rapid object detection using a boosted cascade of simple features , 2001, Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001.

[4]  Hua Ouyang,et al.  A new lip feature representation method for video-based bimodal authentication , 2006 .

[5]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Sridha Sridharan,et al.  Cascading appearance-based features for visual speaker verification , 2008, INTERSPEECH.

[7]  Steve Young,et al.  The HTK book , 1995 .

[8]  Lukáš Burget,et al.  PHONEME RECOGNITION OF MEETINGS USING AUDIO-VISUAL DATA , 2004 .

[9]  Javier R. Movellan,et al.  Dynamic Features for Visual Speechreading: A Systematic Comparison , 1996, NIPS.

[10]  Juergen Luettin,et al.  Speechreading using Probabilistic Models , 1997, Comput. Vis. Image Underst..

[11]  Arun Ross,et al.  Score normalization in multimodal biometric systems , 2005, Pattern Recognit..

[12]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .

[13]  Richard B. Reilly,et al.  Feature analysis for automatic speechreading , 2001, 2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564).

[14]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[15]  Ming Liu,et al.  AVICAR: audio-visual speech corpus in a car environment , 2004, INTERSPEECH.

[16]  Juergen Luettin,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[17]  Jean-Philippe Thiran,et al.  The BANCA Database and Evaluation Protocol , 2003, AVBPA.

[18]  Ara V. Nefian,et al.  A Bayesian Approach to Audio-Visual Speaker Identification , 2003, AVBPA.

[19]  Stephen E. Levinson,et al.  A fused hidden Markov model with application to bimodal speech processing , 2004, IEEE Transactions on Signal Processing.

[20]  Richard B. Reilly,et al.  Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features , 2003, AVBPA.

[21]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[22]  Jason Brand,et al.  Visual Speech: A Physiological or Behavioural Biometric? , 2001, AVBPA.

[23]  David J. Fleet,et al.  Performance of optical flow techniques , 1994, International Journal of Computer Vision.

[24]  Jean-Luc Gauvain,et al.  Speaker adaptation based on MAP estimation of HMM parameters , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  Samy Bengio,et al.  Multimodal speech processing using asynchronous Hidden Markov Models , 2004, Inf. Fusion.

[26]  Patrick Joseph Lucey,et al.  Lipreading across multiple views , 2007 .

[27]  Gregory K. Wallace,et al.  The JPEG still picture compression standard , 1992 .

[28]  Léon J. M. Rothkrantz,et al.  Comparison between different feature extraction techniques for audio-visual speech recognition , 2007, Journal on Multimodal User Interfaces.

[29]  Sridha Sridharan,et al.  Weighting and normalisation of synchronous HMMs for audio-visual speech recognition , 2007, AVSP.

[30]  Josef Bigün,et al.  Audio-visual person authentication using lip-motion from orientation maps , 2007, Pattern Recognit. Lett..

[31]  J. Luettin,et al.  Audio-visual Speech Recognition Workshop 2000 Final Report , 2000 .

[32]  Gerasimos Potamianos,et al.  An image transform approach for HMM based automatic lipreading , 1998, Proceedings 1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269).

[33]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[34]  Giridharan Iyengar,et al.  A cascade image transform for speaker independent automatic speechreading , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[35]  Sridha Sridharan,et al.  Fused HMM-adaptation of multi-stream HMMs for audio-visual speech recognition , 2007, INTERSPEECH.

[36]  Sara H. Basson,et al.  NTIMIT: a phonetically balanced, continuous speech, telephone bandwidth speech database , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[37]  Juergen Luettin,et al.  Evaluation Protocol for the extended M2VTS Database (XM2VTSDB) , 1998 .

[38]  A. Murat Tekalp,et al.  Multimodal speaker/speech recognition using lip motion, lip texture and audio , 2006, Signal Process..

[39]  John S. D. Mason,et al.  The role of dynamics in visual speech biometrics , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.