A new approach to integrate audio and visual features of speech

This paper presents a novel fused-hidden Markov model (fused-HMM) to integrate the audio and visual features of speech. In this model, audio and visual HMMs built individually are fused together using a general probabilistic fusion method, which is optimal in the maximum entropy sense. Specifically, the fusion method uses the dependencies between the audio hidden states and the visual observations to infer the dependencies between audio and video. The learning and inference algorithms described in this paper can handle audio and video features with different data rates and duration. In speaker verification experiments, the results show that the proposed method significantly reduces the recognition error rate as compared to unimodal HMMs and other simpler fusion methods.

[1]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[2]  Jenq-Neng Hwang,et al.  Lipreading from color motion video , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[3]  Thomas S. Huang,et al.  Exploiting the dependencies in information fusion , 1999, Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149).

[4]  Partha Niyogi,et al.  Feature based representation for audio-visual speech recognition , 1999, AVSP.

[5]  Alan C. Bovik,et al.  Computer lipreading for improved accuracy in automatic speech recognition , 1996, IEEE Trans. Speech Audio Process..

[6]  E. Jaynes Information Theory and Statistical Mechanics , 1957 .

[7]  Matthew Brand,et al.  Coupled hidden Markov models for modeling interacting processes , 1997 .