DBN Based Models for Audio-Visual Speech Analysis and Recognition

We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.

[1]  Geoffrey Zweig,et al.  The graphical models toolkit: An open source software system for speech and time-series processing , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[2]  Hichem Sahli,et al.  Context dependent viseme models for voice driven animation , 2003, Proceedings EC-VIP-MC 2003. 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications (IEEE Cat. No.03EX667).

[3]  Chalapathy Neti,et al.  Asynchrony modeling for audio-visual speech recognition , 2002 .

[4]  Ilse Ravyse,et al.  Kernel-based head tracker for videophony , 2005, IEEE International Conference on Image Processing 2005.

[5]  Jonas Beskow,et al.  SYNFACE - A Talking Head Telephone for the Hearing-Impaired , 2004, ICCHP.

[6]  Jeff A. Bilmes,et al.  DBN based multi-stream models for audio-visual speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[7]  Peter Eisert,et al.  Very low bit rate video coding using 3-D models , 2001 .

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  Yi Zhou,et al.  Bayesian tangent shape model: estimating shape and pose parameters via Bayesian inference , 2003, 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings..

[10]  J.A. Bilmes,et al.  Graphical model architectures for speech recognition , 2005, IEEE Signal Processing Magazine.

[11]  Takeo Kanade,et al.  Three-dimensional scene flow , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Jeff A. Bilmes,et al.  DBN based multi-stream models for speech , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[13]  Jeff A. Bilmes,et al.  DBN-based multi-stream models for Mandarin toneme recognition , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..