WaveNet With Cross-Attention for Audiovisual Speech Recognition

In this paper, the WaveNet with cross-attention is proposed for Audio-Visual Automatic Speech Recognition (AV-ASR) to address multimodal feature fusion and frame alignment problems between two data streams. WaveNet is usually used for speech generation and speech recognition, however, in this paper, we extent it to audiovisual speech recognition, and the cross-attention mechanism is introduced into different places of WaveNet for feature fusion. The proposed cross-attention mechanism tries to explore the correlated frames of visual feature to the acoustic feature frame. The experimental results show that the WaveNet with cross-attention can reduce the Tibetan single syllable error about 4.5% and English word error about 39.8% relative to the audio-only speech recognition, and reduce Tibetan single syllable error about 35.1% and English word error about 21.6% relative to the conventional feature concatenation method for AV-ASR.

[1]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[2]  Aggelos K. Katsaggelos,et al.  Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Carlos Busso,et al.  End-to-End Audiovisual Speech Recognition System With Multitask Learning , 2021, IEEE Transactions on Multimedia.

[4]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[5]  Ahmed Hussen Abdelaziz Turbo Decoders for Audio-Visual Continuous Speech Recognition , 2017, INTERSPEECH.

[6]  Navdeep Jaitly,et al.  Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[7]  Juergen Luettin,et al.  A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[8]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[9]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[10]  Hagen Soltau,et al.  Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[11]  Koray Kavukcuoglu,et al.  Pixel Recurrent Neural Networks , 2016, ICML.

[12]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Tara N. Sainath,et al.  Multilingual Speech Recognition with a Single End-to-End Model , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Carlos Busso,et al.  Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition , 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[16]  Juergen Luettin,et al.  Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[17]  Maja Pantic,et al.  End-to-End Audiovisual Fusion with LSTMs , 2017, AVSP.

[18]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[19]  Naomi Harte,et al.  TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech , 2015, IEEE Transactions on Multimedia.

[20]  Maja Pantic,et al.  End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[22]  Wei Chen,et al.  Modality Attention for End-to-end Audio-visual Speech Recognition , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[23]  Maja Pantic,et al.  Deep complementary bottleneck features for visual speech recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Florian Metze,et al.  End-to-end Multimodal Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Qiang Ji,et al.  Tibetan Multi-dialect Speech and Dialect Identity Recognition , 2019, Computers, Materials & Continua.