Aligning Audiovisual Features for Audiovisual Speech Recognition

Visual information can improve the performance of automatic speech recognition (ASR), especially in the presence of background noise or different speech modes. A key problem is how to fuse the acoustic and visual features leveraging their complementary information and overcoming the alignment differences between modalities. Current audiovisual ASR (AV-ASR) systems rely on linear interpolation or extrapolation as a pre-processing technique to align audio and visual features, assuming that the feature sequences are aligned frame-by-frame. These pre-processing methods oversimplify the phase difference between lip motion and speech, lacking flexibility and impairing the performance of the system. This paper addresses the fusion of audiovisual features with an alignment neural network (AliNN), relying on recurrent neural network (RNN) with attention model. The proposed front-end model can automatically learn the alignment from the data. The resulting aligned features are concatenated and fed to conventional back-end ASR systems. The proposed front-end system is evaluated with matched and mismatch channel conditions, under clean and noisy recordings. The results show that our proposed approach can relatively outperform the baseline by 24.9% with Gaussian mixture model with hidden Markov model (GMM-HMM) back-end and 2.4% with deep neural network with hidden Markov model (DNN-HMM) back-end.

[1]  C. Benoît,et al.  28. The Intrinsic Bimodality of Speech Communication and the Synthesis of Talking Faces , 2000 .

[2]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[3]  Timothy J. Hazen Visual model structures and synchrony constraints for audio-visual speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[4]  John H. L. Hansen,et al.  Improving Boundary Estimation in Audiovisual Speech Activity Detection Using Bayesian Information Criterion , 2016, INTERSPEECH.

[5]  Carlos Busso,et al.  Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[6]  Yoshua Bengio,et al.  Attention-Based Models for Speech Recognition , 2015, NIPS.

[7]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[8]  Florian Metze,et al.  Robust end-to-end deep audiovisual speech recognition , 2016, ArXiv.

[9]  Satoshi Tamura,et al.  Integration of deep bottleneck features for audio-visual speech recognition , 2015, INTERSPEECH.

[10]  Carlos Busso,et al.  Bimodal Recurrent Neural Network for Audiovisual Voice Activity Detection , 2017, INTERSPEECH.

[11]  Johan A. du Preez,et al.  Audio-Visual Speech Recognition using SciPy , 2010 .

[12]  Ahmed Hussen Abdelaziz Turbo Decoders for Audio-Visual Continuous Speech Recognition , 2017, INTERSPEECH.

[13]  Juergen Luettin,et al.  Audiovisual Speech Processing: Audiovisual automatic speech recognition , 2012 .

[14]  Juhan Nam,et al.  Multimodal Deep Learning , 2011, ICML.

[15]  Yisong Yue,et al.  A deep learning approach for generalized speech animation , 2017, ACM Trans. Graph..

[16]  Samy Bengio,et al.  An Asynchronous Hidden Markov Model for Audio-Visual Speech Recognition , 2002, NIPS.

[17]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  John H. L. Hansen,et al.  Audio-visual isolated digit recognition for whispered speech , 2011, 2011 19th European Signal Processing Conference.

[20]  Robert M. Nickel,et al.  Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR , 2016, INTERSPEECH.

[21]  Aggelos K. Katsaggelos,et al.  Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[22]  Carlos Busso,et al.  Lipreading approach for isolated digits recognition under whisper and neutral speech , 2014, INTERSPEECH.

[23]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Joon Son Chung,et al.  Out of Time: Automated Lip Sync in the Wild , 2016, ACCV Workshops.

[25]  Uta Noppeney,et al.  Audiovisual asynchrony detection in human speech. , 2011, Journal of experimental psychology. Human perception and performance.

[26]  D. Poeppel,et al.  Temporal window of integration in auditory-visual speech perception , 2007, Neuropsychologia.

[27]  Yochai Konig,et al.  "Eigenlips" for robust speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.