Audio-Visual Efficient Conformer for Robust Speech Recognition

End-to-end Automatic Speech Recognition (ASR) systems based on neural networks have seen large improvements in recent years. The availability of large scale hand-labeled datasets and sufficient computing resources made it possible to train powerful deep neural networks, reaching very low Word Error Rate (WER) on academic benchmarks. However, despite impressive performance on clean audio samples, a drop of performance is often observed on noisy speech. In this work, we propose to improve the noise robustness of the recently proposed Efficient Conformer Connectionist Temporal Classification (CTC)-based architecture by processing both audio and visual modalities. We improve previous lip reading methods using an Efficient Conformer back-end on top of a ResNet-18 visual front-end and by adding intermediate CTC losses between blocks. We condition intermediate block features on early predictions using Inter CTC residual modules to relax the conditional independence assumption of CTC-based models. We also replace the Efficient Conformer grouped attention by a more efficient and simpler attention mechanism that we call patch attention. We experiment with publicly available Lip Reading Sentences 2 (LRS2) and Lip Reading Sentences 3 (LRS3) datasets. Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps. Our Audio-Visual Efficient Conformer (AVEC) model achieves state-of-the-art performance, reaching WER of 2.3% and 1.8% on LRS2 and LRS3 test sets. Code and pretrained models are available at https://github.com/burchim/AVEC.

[1]  M. Pantic,et al.  Visual speech recognition for multiple languages in the wild , 2022, Nature Machine Intelligence.

[2]  Triantafyllos Afouras,et al.  Sub-word Level Lip Reading With Visual Attention , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Joon Son Chung,et al.  Deep Audio-Visual Speech Recognition , 2018, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[4]  Valentin Vielzeuf,et al.  Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition , 2021, 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[5]  Christoph Feichtenhofer,et al.  Multiscale Vision Transformers , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[6]  Tatsuya Komatsu,et al.  Relaxing the Conditional Independence Assumption of CTC-based ASR by Conditioning on Intermediate Predictions , 2021, Interspeech.

[7]  Boris Ginsburg,et al.  Citrinet: Closing the Gap between Non-Autoregressive and Autoregressive End-to-End Models for Automatic Speech Recognition , 2021, 2104.01721.

[8]  Maja Pantic,et al.  End-To-End Audio-Visual Speech Recognition with Conformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Shinji Watanabe,et al.  Intermediate Loss Regularization for CTC-Based Speech Recognition , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Shinji Watanabe,et al.  Recent Developments on Espnet Toolkit Boosted By Conformer , 2020, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Irene Kotsia,et al.  RetinaFace: Single-Shot Multi-Level Face Localisation in the Wild , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[13]  Yu Zhang,et al.  Conformer: Convolution-augmented Transformer for Speech Recognition , 2020, INTERSPEECH.

[14]  Yandong Guo,et al.  Discriminative Multi-Modality Speech Recognition , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Yonghui Wu,et al.  ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context , 2020, INTERSPEECH.

[16]  Qian Zhang,et al.  Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Dong Yu,et al.  Audio-Visual Recognition of Overlapped Speech for the LRS2 Dataset , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Quoc V. Le,et al.  Specaugment on Large Scale Datasets , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Joon Son Chung,et al.  ASR is All You Need: Cross-Modal Distillation for Lip Reading , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Haihong Tang,et al.  Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers , 2019, AAAI.

[21]  Boris Ginsburg,et al.  Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Olivier Siohan,et al.  Recurrent Neural Network Transducer for Audio-Visual Speech Recognition , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[23]  Shilin Wang,et al.  Spatio-Temporal Fusion Based Convolutional Sequence Learning for Lip Reading , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[24]  Xiaofei Wang,et al.  A Comparative Study on Transformer vs RNN in Speech Applications , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[25]  Boris Ginsburg,et al.  Jasper: An End-to-End Convolutional Neural Acoustic Model , 2019, INTERSPEECH.

[26]  Thomas Paine,et al.  Large-Scale Visual Speech Recognition , 2018, INTERSPEECH.

[27]  Maja Pantic,et al.  Audio-Visual Speech Recognition with a Hybrid CTC/Attention Architecture , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[28]  Naomi Harte,et al.  Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition , 2018, ICMI.

[29]  Joon Son Chung,et al.  LRS3-TED: a large-scale dataset for visual speech recognition , 2018, ArXiv.

[30]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[31]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[33]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Georgios Tzimiropoulos,et al.  How Far are We from Solving the 2D & 3D Face Alignment Problem? (and a Dataset of 230,000 3D Facial Landmarks) , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).

[36]  Joon Son Chung,et al.  Lip Reading Sentences in the Wild , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Joon Son Chung,et al.  Lip Reading in Profile , 2017, BMVC.

[38]  Joon Son Chung,et al.  Lip Reading in the Wild , 2016, ACCV.

[39]  Shimon Whiteson,et al.  LipNet: End-to-End Sentence-level Lipreading , 2016, 1611.01599.

[40]  Gabriel Synnaeve,et al.  Wav2Letter: an End-to-End ConvNet-based Speech Recognition System , 2016, ArXiv.

[41]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[45]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[46]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[47]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[48]  Herman J. M. Steeneken,et al.  Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems , 1993, Speech Commun..