An Examination of Audio-Visual Fused HMMs for Speaker Recognition

Fused hidden Markov models (FHMMs) have been shown to work well for the task of audio-visual speaker recognition, but only in an output decision-fusion configuration of both the audio- and video-biased versions of the FHMM structure. This paper looks at the performance of the audioand video-biased versions independently, and shows that the audio-biased version is considerably more capable for speaker recognition. Additionally, this paper shows that by taking advantage of the temporal relationship between the acoustic and visual data, the audio-biased FHMM provides better performance at less processing cost than best-performing output decision-fusion of regular HMMs.

[1]  J.N. Gowdy,et al.  CUAVE: A new audio-visual database for multimodal human-computer interface research , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Kuldip K. Paliwal,et al.  Noise compensation in a person verification system using face and multiple speech feature , 2003, Pattern Recognit..

[3]  Alex Pentland,et al.  A Bayesian Computer Vision System for Modeling Human Interactions , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[4]  Farzin Deravi,et al.  A review of speech-based bimodal recognition , 2002, IEEE Trans. Multim..

[5]  Stephen E. Levinson,et al.  A fused hidden Markov model with application to bimodal speech processing , 2004, IEEE Transactions on Signal Processing.

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  Sridha Sridharan,et al.  Comparing audio and visual information for speech processing , 2005, Proceedings of the Eighth International Symposium on Signal Processing and Its Applications, 2005..

[8]  Jiri Matas,et al.  XM2VTSDB: The Extended M2VTS Database , 1999 .