Integration of deep bottleneck features for audio-visual speech recognition

Recent interest in “deep learning”, which can be defined as the use of algorithms to model high-level abstractions in data, using models composed of multiple non-linear transformations, has resulted in an increase in the number of studies investigating the use of deep learning with automatic speech recognition (ASR) systems. Some of these studies have found that bottleneck features extracted from deep neural networks (DNNs), sometimes called “deep bottleneck features” (DBNFs), can reduce the word error rates of ASR systems. However, there has been little research on audio-visual speech recognition (AVSR) systems using DBNFs. In this paper, we propose a method of integrating DBNFs using multi-stream HMMs in order to improve the performance of AVSRs under both clean and noisy conditions. We evaluate our method using a continuously spoken, Japanese digit recognition task under matched and mismatched conditions. Relative word error reduction rates of roughly 68.7%, 47.4%, and 51.9% were achieved, compared with an audio-only ASR system and two feature-fusion models, which employed DBNFs and single-stream HMMs, respectively.

[1]  Satoshi Nakamura,et al.  CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition , 2010, AVSP.

[2]  Tsuhan Chen,et al.  Audio-visual integration in multimodal communication , 1998, Proc. IEEE.

[3]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[4]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[5]  Yoshua Bengio,et al.  Extracting and composing robust features with denoising autoencoders , 2008, ICML '08.

[6]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[7]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[8]  Thomas Hofmann,et al.  Greedy Layer-Wise Training of Deep Networks , 2007 .

[9]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Florian Metze,et al.  Extracting deep bottleneck features using stacked auto-encoders , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[11]  Vaibhava Goel,et al.  Efficient likelihood computation in multi-stream HMM based audio-visual speech recognition , 2004, Interspeech.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[14]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[15]  Dong Yu,et al.  Conversational Speech Transcription Using Context-Dependent Deep Neural Networks , 2012, ICML.

[16]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[17]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[18]  Petros Maragos,et al.  Adaptive multimodal fusion by uncertainty compensation , 2006, INTERSPEECH.