论文信息 - Modality Attention for End-to-end Audio-visual Speech Recognition

Modality Attention for End-to-end Audio-visual Speech Recognition

Audio-visual speech recognition (AVSR) system is thought to be one of the most promising solutions for robust speech recognition, especially in noisy environment. In this paper, we propose a novel multimodal attention based method for audio-visual speech recognition which could automatically learn the fused representation from both modalities based on their importance. Our method is realized using state-of-the-art sequence-to-sequence (Seq2seq) architectures. Experimental results show that relative improvements from 2% up to 36% over the auditory modality alone are obtained depending on the different signal-to-noise-ratio (SNR). Compared to the traditional feature concatenation methods, our proposed approach can achieve better recognition performance under both clean and noisy conditions. We believe modality attention based end-to-end method can be easily generalized to other multimodal tasks with correlated information.

[1] Joon Son Chung,et al. Lip Reading in the Wild , 2016, ACCV.

[2] Joon Son Chung,et al. Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[3] Ian Lane,et al. End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition , 2017, INTERSPEECH.

[4] Juergen Luettin,et al. Visual speech recognition using active shape models and hidden Markov models , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[5] Maja Pantic,et al. End-to-End Audiovisual Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6] Tetsuya Ogata,et al. Audio-visual speech recognition using deep learning , 2014, Applied Intelligence.

[7] Quoc V. Le,et al. Listen, attend and spell: A neural network for large vocabulary conversational speech recognition , 2015, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Tetsuya Ogata,et al. Lipreading using convolutional neural network , 2014, INTERSPEECH.

[9] Eric David Petajan,et al. Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[10] Tara N. Sainath,et al. State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Yoshua Bengio,et al. End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results , 2014, ArXiv.

[12] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[13] Shih-Chii Liu,et al. Multi-channel Attention for End-to-End Speech Recognition , 2018, INTERSPEECH.

[14] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[15] Timothy F. Cootes,et al. Active Appearance Models , 1998, ECCV.

[16] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[17] Thomas Hueber,et al. Feature extraction using multimodal convolutional neural networks for visual speech recognition , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18] Jürgen Schmidhuber,et al. Lipreading with long short-term memory , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Aggelos K. Katsaggelos,et al. Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20] Alex Graves,et al. Sequence Transduction with Recurrent Neural Networks , 2012, ArXiv.

[21] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[22] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[23] Davis E. King,et al. Dlib-ml: A Machine Learning Toolkit , 2009, J. Mach. Learn. Res..

[24] H. McGurk,et al. Hearing lips and seeing voices , 1976, Nature.

[25] Chalapathy Neti,et al. Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[26] Aggelos K. Katsaggelos,et al. Audiovisual Fusion: Challenges and New Approaches , 2015, Proceedings of the IEEE.

[27] Rohit Prabhavalkar,et al. Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[28] Shimon Whiteson,et al. LipNet: Sentence-level Lipreading , 2016, ArXiv.

[29] Juergen Luettin,et al. A comparison of model and transform-based visual features for audio-visual LVCSR , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..