论文信息 - Speech Model Pre-training for End-to-End Spoken Language Understanding

Speech Model Pre-training for End-to-End Spoken Language Understanding

Whereas conventional spoken language understanding (SLU) systems map speech to text, and then text to intent, end-to-end SLU systems map speech directly to intent through a single trainable model. Achieving high accuracy with these end-to-end models without a large amount of training data is difficult. We propose a method to reduce the data requirements of end-to-end SLU in which the model is first pre-trained to predict words and phonemes, thus learning good features for SLU. We introduce a new SLU dataset, Fluent Speech Commands, and show that our method improves performance both when the full dataset is used for training and when only a small subset is used. We also describe preliminary experiments to gauge the model's ability to generalize to new phrases not heard during training.

[1] Mitch Weintraub,et al. Acoustic Modeling for Google Home , 2017, INTERSPEECH.

[2] Sebastian Ruder,et al. Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[3] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[4] Chong Wang,et al. Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin , 2015, ICML.

[5] Dilek Z. Hakkani-Tür,et al. Spoken language understanding , 2008, IEEE Signal Processing Magazine.

[6] Brian Kingsbury,et al. Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7] Sanjeev Khudanpur,et al. Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[8] Thomas Wolf,et al. A Hierarchical Multi-task Approach for Learning Embeddings from Semantic Tasks , 2018, AAAI.

[9] Yoshua Bengio,et al. Interpretable Convolutional Filters with SincNet , 2018, ArXiv.

[10] Quoc V. Le,et al. Semi-supervised Sequence Learning , 2015, NIPS.

[11] Morgan Sonderegger,et al. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi , 2017, INTERSPEECH.

[12] Julius Kunze,et al. Transfer Learning for Speech Recognition on a Budget , 2017, Rep4NLP@ACL.

[13] Hugo Van hamme,et al. Capsule Networks for Low Resource Spoken Language Understanding , 2018, INTERSPEECH.

[14] Dong Yu,et al. Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[15] Francesco Caltagirone,et al. Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces , 2018, ArXiv.

[16] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[17] Geoffrey Zweig,et al. Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[18] Francesco Caltagirone,et al. Spoken Language Understanding on the Edge , 2018, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[19] Thomas Fang Zheng,et al. Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[20] Sanjeev Khudanpur,et al. Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[23] Giuseppe Riccardi,et al. How may I help you? , 1997, Speech Commun..

[24] Geoffrey E. Hinton,et al. Dynamic Routing Between Capsules , 2017, NIPS.

[25] Yoshua Bengio,et al. Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[26] Yoshua Bengio,et al. Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[27] Navdeep Jaitly,et al. Towards End-To-End Speech Recognition with Recurrent Neural Networks , 2014, ICML.

[28] Rich Caruana,et al. Multitask Learning , 1998, Encyclopedia of Machine Learning and Data Mining.

[29] Yoshua Bengio,et al. How transferable are features in deep neural networks? , 2014, NIPS.

[30] David Suendermann-Oeft,et al. Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[31] Florian Metze,et al. Hierarchical Multitask Learning With CTC , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[32] Hagen Soltau,et al. Neural Speech Recognizer: Acoustic-to-Word LSTM Model for Large Vocabulary Speech Recognition , 2016, INTERSPEECH.

[33] Bhuvana Ramabhadran,et al. Direct Acoustics-to-Word Models for English Conversational Speech Recognition , 2017, INTERSPEECH.

[34] Arun Narayanan,et al. From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[35] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[36] Jürgen Schmidhuber,et al. Sequence Labelling in Structured Domains with Hierarchical Recurrent Neural Networks , 2007, IJCAI.

[37] Sebastian Stüker,et al. Using multi-task learning to improve the performance of acoustic-to-word and conventional hybrid models , 2019, ArXiv.

[38] Yongqiang Wang,et al. Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[39] Srinivas Bangalore,et al. Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[40] Karen Livescu,et al. Hierarchical Multitask Learning for CTC-based Speech Recognition , 2018, ArXiv.

[41] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[42] Gokhan Tur,et al. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .