Fusing audio and visual features of speech

In this paper, the audio and visual features of speech are integrated using a novel fused-HMM. We assume that the two sets of features may have different data rates and duration. Hidden Markov models (HMMs) are first used to model them separately, and then a general Bayesian fusion method, which is optimal in the maximum entropy sense, is employed to fuse them together. Particularly, an efficient learning algorithm is introduced. Instead of maximizing the joint likelihood of the fuse-HMM, the learning algorithm maximizes the two HMMs separately, and then fuses the HMMs together. In addition, an inference algorithm is proposed. We have tested the proposed method by person verification experiments. Results show that the proposed method significantly reduces the recognition error rates as compared to the unimodal HMMs and the loosely-coupled fusion model.

[1]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[2]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[3]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[4]  Matthew Brand,et al.  Coupled hidden Markov models for modeling interacting processes , 1997 .

[5]  Thomas S. Huang,et al.  Exploiting the dependencies in information fusion , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).