Multi-Stream Asynchrony Dynamic Bayesian Network Model for Audio-Visual Continuous Speech Recognition

How best to describe the asynchrony of the speech and lip motion is a key problem of audio-visual speech recognition model. A multi-stream asynchrony dynamic Bayesian network (MS-ADBN) model is brought forward for audio-visual speech recognition, and in this model, audio stream and visual stream are synchronous in word node, while between the word nodes, each stream has its own independent phone, phone transition and observation vector node, and word transition probability is determined by audio stream and visual stream together. For each stream, each word is composed of its corresponding phones, and each phone is associated with observation feature (audio feature for audio stream and visual feature for visual stream), with some probability modeled by Gaussian mixed model. Compare with general multi-stream HMM, MS-ADBN model describes the asynchrony of audio stream and visual stream to the word level. The experiment results on continuous digit audio visual database show that: compare with multi-stream HMM, in the mismatch noise environment, an average improvement of 10.07% are obtained for MS-ADBN model.

[1]  Ilse Ravyse,et al.  DBN Based Models for Audio-Visual Speech Analysis and Recognition , 2006, PCM.

[2]  Jeff A. Bilmes,et al.  DBN based multi-stream models for speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[3]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[4]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[6]  Yi Zhou,et al.  Bayesian tangent shape model: estimating shape and pose parameters via Bayesian inference , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[7]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..