论文信息 - End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

End-to-End Spoken Language Understanding Using Transformer Networks and Self-Supervised Pre-Trained Features

Transformer networks and self-supervised pre-training have consistently delivered state-of-art results in the field of natural language processing (NLP); however, their merits in the field of spoken language understanding (SLU) still need further investigation. In this paper we introduce a modular End-to-End (E2E) SLU transformer network based architecture which allows the use of self-supervised pre-trained acoustic features, pre-trained model initialization and multi-task training. Several SLU experiments for predicting intent and entity labels/values using the ATIS dataset are performed. These experiments investigate the interaction of pre-trained model initialization and multi-task training with either traditional filterbank or self-supervised pre-trained acoustic features. Results show not only that self-supervised pre-trained acoustic features outperform filterbank features in almost all the experiments, but also that when these features are used in combination with multi-task training, they almost eliminate the necessity of pre-trained model initialization.

[1] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[2] Kevin Duh,et al. ESPnet-ST: All-in-One Speech Translation Toolkit , 2020, ACL.

[3] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[4] Ronan Collobert,et al. wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[5] Fethi Bougares,et al. Investigating Self-Supervised Pre-Training for End-to-End Speech Translation , 2020, INTERSPEECH.

[6] Rich Caruana,et al. Multitask Learning: A Knowledge-Based Source of Inductive Bias , 1993, ICML.

[7] Bing Liu,et al. Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling , 2016, INTERSPEECH.

[8] Arun Narayanan,et al. From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[9] Shang-Wen Li,et al. TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech , 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[10] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[11] Xuedong Huang,et al. Make Lead Bias in Your Favor: A Simple and Effective Method for News Summarization , 2019, arXiv.org.

[12] Houfeng Wang,et al. A Joint Model of Intent Determination and Slot Filling for Spoken Language Understanding , 2016, IJCAI.

[13] Hai Zhao,et al. Semantics-aware BERT for Language Understanding , 2020, AAAI.

[14] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[15] Chenguang Zhu,et al. SDNet: Contextualized Attention-based Deep Network for Conversational Question Answering , 2018, ArXiv.

[16] James R. Glass,et al. Improved Speech Representations with Multi-Target Autoregressive Predictive Coding , 2020, ACL.

[17] Mitchell P. Marcus,et al. Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[18] Siegfried Kunzmann,et al. End-to-End Neural Transformer Based Spoken Language Understanding , 2020, INTERSPEECH.

[19] Alexei Baevski,et al. vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[20] Yannick Estève,et al. Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability , 2019, INTERSPEECH.

[21] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Daniel Povey,et al. The Kaldi Speech Recognition Toolkit , 2011 .

[23] Brian Kingsbury,et al. End-to-End Spoken Language Understanding Without Full Transcripts , 2020, INTERSPEECH.

[24] Alexei Baevski,et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[25] Hung-yi Lee,et al. Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26] Yoshua Bengio,et al. Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[27] Shang-Wen Li,et al. Audio Albert: A Lite Bert for Self-Supervised Learning of Audio Representation , 2021, 2021 IEEE Spoken Language Technology Workshop (SLT).