Special Section on Recent Advances in Machine Learning for Spoken Language Processing Investigation of DNN-Based Audio-Visual Speech Recognition

Audio-Visual Speech Recognition (AVSR) is one of techniques to enhance robustness of speech recognizer in noisy or real environments. On the other hand, Deep Neural Networks (DNNs) have recently attracted a lot of attentions of researchers in the speech recognition field, because we can drastically improve recognition performance by using DNNs. There are two ways to employ DNN techniques for speech recognition: a hybrid approach and a tandem approach; in the hybrid approach an emission probability on each Hidden Markov Model (HMM) state is computed using a DNN, while in the tandem approach a DNN is composed into a feature extraction scheme. In this paper, we investigate and compare several DNN-based AVSR methods to mainly clarify how we should incorporate audio and visual modalities using DNNs. We carried out recognition experiments using a corpus CENSREC-1-AV, and we discuss the results to find out the best DNN-based AVSR modeling. Then it turns out that a tandembased method using audio Deep Bottle-Neck Features (DBNFs) and visual ones with multi-stream HMMs is the most suitable, followed by a hybrid approach and another tandem scheme using audio-visual DBNFs. key words: audio-visual speech recognition, deep neural network, Deep Bottleneck Feature, multi-stream HMM

[1]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[2]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[3]  B.D. Van Veen,et al.  Beamforming: a versatile approach to spatial filtering , 1988, IEEE ASSP Magazine.

[4]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[5]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[6]  Philip C. Woodland,et al.  Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models , 1995, Comput. Speech Lang..

[7]  Keiichi Tokuda,et al.  Audio-visual speech recognition using MCE-based hmms and model-dependent stream weights , 2000, INTERSPEECH.

[8]  Chalapathy Neti,et al.  Stream confidence estimation for audio-visual speech recognition , 2000, INTERSPEECH.

[9]  Koji Iwano Bimodal speech recognition using lip movement measured by optical flow analysis , 2001 .

[10]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[11]  Satoshi Tamura,et al.  Voice activity detection based on fusion of audio and visual information , 2009, AVSP.

[12]  Barry-John Theobald,et al.  Comparing visual features for lipreading , 2009, AVSP.

[13]  Satoshi Nakamura,et al.  CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition , 2010, AVSP.

[14]  Tetsuya Takiguchi,et al.  Multimodal speech recognition of a person with articulation disorders using AAM and MAF , 2010, 2010 IEEE International Workshop on Multimedia Signal Processing.

[15]  Norihiro Hagita,et al.  Real-time audio-visual voice activity detection for speech recognition in noisy environments , 2010, AVSP.

[16]  Dong Yu,et al.  Improved Bottleneck Features Using Pretrained Deep Neural Networks , 2011, INTERSPEECH.

[17]  S. Hayamizu,et al.  Audio-visual Interaction in Model Adaptation for Multi-modal Speech Recognition , 2011 .

[18]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[19]  Satoshi Tamura,et al.  GIF-SP: GA-based informative feature for noisy speech recognition , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[20]  Geoffrey E. Hinton,et al.  Acoustic Modeling Using Deep Belief Networks , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[21]  Satoshi Tamura,et al.  GIF-LR:GA-based informative feature for lipreading , 2012, Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference.

[22]  Jing Huang,et al.  Audio-visual deep learning for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Satoshi Tamura,et al.  Data collection for mobile audio-visual speech recognition in various environments , 2014, 2014 17th Oriental Chapter of the International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (COCOSDA).

[24]  Tetsuya Ogata,et al.  Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[25]  Denis Burnham Keynote 1: Big Data and Resource Sharing: A speech corpus and a Virtual Laboratory for facilitating human communication science research , 2014, O-COCOSDA.

[26]  Satoshi Tamura,et al.  Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[27]  Satoshi Tamura,et al.  Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[28]  Vaibhava Goel,et al.  Detecting audio-visual synchrony using deep neural networks , 2015, INTERSPEECH.

[29]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.