论文信息 - St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

St-Bert: Cross-Modal Language Model Pre-Training for End-to-End Spoken Language Understanding

Language model pre-training has shown promising results in various downstream tasks. In this context, we introduce a cross-modal pre-trained language model, called Speech-Text BERT (ST-BERT), to tackle end-to-end spoken language understanding (E2E SLU) tasks. Taking phoneme posterior and subword-level text as an input, ST-BERT learns a contextualized cross-modal alignment via our two proposed pre-training tasks: Cross-modal Masked Language Modeling (CM-MLM) and Cross-modal Conditioned Language Modeling (CM-CLM). Experimental results on three benchmarks present that our approach is effective for various SLU datasets and shows a surprisingly marginal performance degradation even when 1% of the training data are available. Also, our method shows further SLU performance gain via domain-adaptive pre-training with domain-specific speech-text pair data.

[1] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.

[2] Yun-Nung (Vivian) Chen,et al. Learning Asr-Robust Contextualized Embeddings for Spoken Language Understanding , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Srinivas Bangalore,et al. Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Yannick Estève,et al. Recent Advances in End-to-End Spoken Language Understanding , 2019, SLSP.

[5] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6] Ali Farhadi,et al. From Recognition to Cognition: Visual Commonsense Reasoning , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7] Pengwei Wang,et al. Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Ryan Price. End-To-End Spoken Language Understanding Without Matched Language Speech Model Pretraining Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[10] Yoshua Bengio,et al. Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[11] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.

[12] Ming Zhou,et al. Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks , 2019, EMNLP.

[13] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14] Jason Weston,et al. Curriculum learning , 2009, ICML '09.

[15] Doug Downey,et al. Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[16] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17] Yung-Sung Chuang,et al. SpeechBERT: Cross-Modal Pre-trained Language Model for End-to-end Spoken Question Answering , 2019, ArXiv.

[18] Francesco Caltagirone,et al. Spoken Language Understanding on the Edge , 2018, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[19] Morgan Sonderegger,et al. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[20] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21] Jung-Woo Ha,et al. NSML: Meet the MLaaS platform with a real-world case study , 2018, ArXiv.

[22] Geoffrey E. Hinton,et al. Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[23] Sunil Kumar Kopparapu,et al. End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios , 2019, INTERSPEECH.

[24] Zhenglu Yang,et al. Curriculum Pre-training for End-to-End Speech Translation , 2020, ACL.

[25] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.

[26] Francesco Caltagirone,et al. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[27] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[28] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[29] Ngoc Thang Vu,et al. Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning , 2020, INTERSPEECH.

[30] James R. Glass,et al. Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces , 2018, NeurIPS.