End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios

End-to-end Spoken Language Understanding (SLU) systems, without speech-to-text conversion, are more promising in low resource scenarios. They can be more effective when there is not enough labeled data to train reliable speech recognition and language understanding systems, or where running SLU on edge is preferred over cloud based services. In this paper, we present an approach for bootstrapping end-to-end SLU in low resource scenarios. We show that incorporating layers extracted from pre-trained acoustic models, instead of using the typical Mel filter bank features, lead to better performing SLU models. Moreover, the layers extracted from a model pre-trained on one language perform well even for (a) SLU tasks on a different language and also (b) on utterances from speakers with speech disorder.

[1]  Srinivas Bangalore,et al.  Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[3]  Yannick Estève,et al.  Simulating ASR errors for training SLU systems , 2018, LREC.

[4]  Sanjeev Khudanpur,et al.  JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[5]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Michel Vacher,et al.  Analyzing the Performance of Automatic Speech Recognition for Ageing Voice: Does it Correlate with Dependency Level? , 2013, SLPAT.

[7]  Hugo Van hamme,et al.  Capsule Networks for Low Resource Spoken Language Understanding , 2018, INTERSPEECH.

[8]  Jort F. Gemmeke The self-taught vocal interface , 2014, HSCMA.

[9]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[10]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).

[11]  Francesco Caltagirone,et al.  Spoken Language Understanding on the Edge , 2018, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[12]  Mark J. F. Gales,et al.  Using VTLN for broadcast news transcription , 2004, INTERSPEECH.

[13]  Diego Giuliani,et al.  Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Aaron Lawson,et al.  Exploring the role of phonetic bottleneck features for speaker and language recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Hugo Van hamme,et al.  Acquisition of ordinal words using weakly supervised NMF , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[16]  Sanjeev Khudanpur,et al.  Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[18]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[19]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[21]  Shinji Watanabe,et al.  Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[22]  Josephine Lau,et al.  Alexa, Are You Listening? , 2018, Proc. ACM Hum. Comput. Interact..

[23]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[24]  Frank Rudzicz,et al.  Using articulatory likelihoods in the recognition of dysarthric speech , 2012, Speech Commun..

[25]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.