ARoBERT: An ASR Robust Pre-Trained Language Model for Spoken Language Understanding

Spoken Language Understanding (SLU) aims to interpret the meanings of human speeches in order to support various human-machine interaction systems. A key technique for SLU is Automatic Speech Recognition (ASR), which transcribes speech signals into text contents. As the output texts of modern ASR systems unavoidably contain errors, mainstream SLU models either trained or tested on texts transcribed by ASR systems would not be sufficiently error robust. We present ARoBERT, an ASR Robust BERT model, which can be fine-tuned to solve a variety of SLU tasks with noisy inputs. To guarantee the robustness of ARoBERT, during pretraining, we decrease the fluctuations of language representations when some parts of the input texts are replaced by homophones or synophones. Specifically, we propose two novel self-supervised pre-training tasks for ARoBERT, namely Phonetically-aware Masked Language Modeling (PMLM) and ASR Model-adaptive Masked Language Modeling (AMMLM). The PMLM task explicitly fuses the knowledge of word phonetic similarities into the pre-training process, which forces homophones and synophones to share similar representations. In AMMLM, a data-driven algorithm is further introduced to mine typical ASR errors such that ARoBERT can tolerate ASR model errors. In the experiments, we evaluate ARoBERT over multiple datasets. The results show the superiority of ARoBERT, which consistently outperforms strong baselines. We have also shown that ARoBERT outperforms state-of-the-arts on a public benchmark. Currently, ARoBERT has been deployed in an online production system with significant improvements.

[1]  Kunal Dhawan,et al.  Phonetic Word Embeddings , 2021, ArXiv.

[2]  Alejandro Mottini,et al.  Top-Down Attention in End-to-End Spoken Language Understanding , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Dave Dopson,et al.  Fast WordPiece Tokenization , 2020, EMNLP.

[4]  Brian Kingsbury,et al.  Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition , 2020, INTERSPEECH.

[5]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[6]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[7]  Hermann Ney,et al.  Robust Beam Search for Encoder-Decoder Attention Based Speech Recognition Without Length Bias , 2020, INTERSPEECH.

[8]  Ye Wang,et al.  A-CRNN: A Domain Adaptation Model for Sound Event Detection , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Wei Chu,et al.  SpellGCN: Incorporating Phonological and Visual Similarities into Language Models for Chinese Spelling Check , 2020, ACL.

[10]  Xipeng Qiu,et al.  Pre-trained models for natural language processing: A survey , 2020, Science China Technological Sciences.

[11]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[12]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[13]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[14]  Yun-Nung (Vivian) Chen,et al.  Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Khan M. Iftekharuddin,et al.  Survey on Deep Neural Networks in Speech and Vision Systems , 2019, Neurocomputing.

[16]  Luo Si,et al.  StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding , 2019, ICLR.

[17]  Hao Li,et al.  Robust Spoken Language Understanding with Acoustic and Domain Knowledge , 2019, ICMI.

[18]  Tiejun Zhao,et al.  CATSLU: The 1st Chinese Audio-Textual Spoken Language Understanding Challenge , 2019, ICMI.

[19]  Zhenhua Ling,et al.  Multi-Classification Model for Spoken Language Understanding , 2019, ICMI.

[20]  Tiejun Zhao,et al.  Transfer Learning Methods for Spoken Language Understanding , 2019, ICMI.

[21]  Kenji Iwata,et al.  Slot Filling with Weighted Multi-Encoders for Out-of-Domain Values , 2019, INTERSPEECH.

[22]  Yannick Estève,et al.  Investigating Adaptation and Transfer Learning for End-to-End Spoken Language Understanding from Speech , 2019, INTERSPEECH.

[23]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[24]  Lei Wang,et al.  Convolutional Recurrent Neural Networks for Text Classification , 2019, 2019 International Joint Conference on Neural Networks (IJCNN).

[25]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[26]  Yongqiang Wang,et al.  End-to-end Contextual Speech Recognition Using Class Language Models and a Token Passing Decoder , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Panayiotis G. Georgiou,et al.  Confusion2Vec: towards enriching vector space word representations with representational ambiguities , 2018, PeerJ Comput. Sci..

[28]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[29]  Wei Lin,et al.  Transfer Learning for Context-Aware Question Matching in Information-seeking Conversations in E-commerce , 2018, ACL.

[30]  Kai Yu,et al.  Robust Spoken Language Understanding with Unsupervised ASR-Error Adaptation , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[31]  Shuang Xu,et al.  Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[33]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[34]  Arul Valiyavalappil Haridas,et al.  A critical review and analysis on techniques of speech recognition: The road ahead , 2018, Int. J. Knowl. Based Intell. Eng. Syst..

[35]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[36]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[37]  Liang Lu,et al.  Multitask Learning with Low-Level Auxiliary Tasks for Encoder-Decoder Based Speech Recognition , 2017, INTERSPEECH.

[38]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[39]  Chong Wang,et al.  Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[40]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[41]  Jia Liu,et al.  Using word confusion networks for slot filling in spoken language understanding , 2015, INTERSPEECH.

[42]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.