论文信息 - Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR

Automatic speech recognition (ASR) enables very intuitive human-machine interaction. However, signal degradations due to reverberation or noise reduce the accuracy of audio-based recognition. The introduction of a second signal stream that is not affected by degradations in the audio domain (e.g., a video stream) increases the robustness of ASR against degradations in the original domain. Here, depending on the signal quality of audio and video at each point in time, a dynamic weighting of both streams can optimize the recognition performance. In this work, we introduce a strategy for estimating optimal weights for the audio and video streams in turbo-decodingbased ASR using a discriminative cost function. The results show that turbo decoding with this maximally discriminative dynamic weighting of information yields higher recognition accuracy than turbo-decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.

[1] Mohan M. Trivedi,et al. Multimodal information fusion using the iterative decoding algorithm and its application to audio-visual speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[2] Lawrence R. Rabiner,et al. A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3] Reinhold Haeb-Umbach,et al. Robust Speech Recognition of Uncertain or Missing Data - Theory and Applications , 2011 .

[4] Jon Barker,et al. An audio-visual corpus for speech perception and automatic speech recognition. , 2006, The Journal of the Acoustical Society of America.

[5] Kevin P. Murphy,et al. Dynamic Bayesian Networks for Audio-Visual Speech Recognition , 2002, EURASIP J. Adv. Signal Process..

[6] Satoshi Tamura,et al. Audio-visual speech recognition using deep bottleneck features and high-performance lipreading , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[7] Guy J. Brown,et al. Robust audiovisual speech recognition using noise-adaptive linear discriminant analysis , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Lalit R. Bahl,et al. Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9] Brian Kingsbury,et al. Boosted MMI for model and feature-space discriminative training , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10] Dorothea Kolossa,et al. Learning Dynamic Stream Weights For Coupled-HMM-Based Audio-Visual Speech Recognition , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Dorothea Kolossa,et al. A newem estimationof dynamic stream weights for coupled-HMM-based audio-visual ASR , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12] Ramón Fernández Astudillo,et al. Use of Missing and Unreliable Data for Audiovisual Speech Recognition , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[13] Xiao Li,et al. Machine Learning Paradigms for Speech Recognition: An Overview , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[14] Alain Glavieux,et al. Reflections on the Prize Paper : "Near optimum error-correcting coding and decoding: turbo codes" , 1998 .

[15] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[16] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[17] A. Glavieux,et al. Near Shannon limit error-correcting coding and decoding: Turbo-codes. 1 , 1993, Proceedings of ICC '93 - IEEE International Conference on Communications.

[18] Tim Fingscheidt,et al. A Turbo-Decoding Weighted Forward-Backward Algorithm for Multimodal Speech Recognition , 2016 .

[19] Tim Fingscheidt,et al. Turbo Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20] Juhan Nam,et al. Multimodal Deep Learning , 2011, ICML.

[21] Vaibhava Goel,et al. Deep multimodal learning for Audio-Visual Speech Recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).