Transfer Learning for Low Resource Spoken Language Understanding without Speech-to-Text

Spoken Language Understanding (SLU) without speech-to-text conversion is more promising in low resource scenarios. These could be applications where there is not enough labeled data to train reliable speech recognition and language understanding systems, or where running SLU on edge is preferred over cloud based services. In this paper, we present an approach for building SLU without speech-to-text conversion in low resource scenarios using a transfer learning approach. We show that the intermediate layer representations from a pre-trained model outperforms the typically used Mel filter bank features. Moreover, the representations extracted from a model pre-trained on one language perform well even for SLU tasks on a different language.

[1]  Mark J. F. Gales,et al.  Using VTLN for broadcast news transcription , 2004, INTERSPEECH.

[2]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Michel Vacher,et al.  Analyzing the Performance of Automatic Speech Recognition for Ageing Voice: Does it Correlate with Dependency Level? , 2013, SLPAT.

[4]  Tara N. Sainath,et al.  Large vocabulary automatic speech recognition for children , 2015, INTERSPEECH.

[5]  Diego Giuliani,et al.  Non-Native Children Speech Recognition Through Transfer Learning , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Gokhan Tur,et al.  Spoken Language Understanding: Systems for Extracting Semantic Information from Speech , 2011 .

[7]  Frank Rudzicz,et al.  Using articulatory likelihoods in the recognition of dysarthric speech , 2012, Speech Commun..

[8]  Francesco Caltagirone,et al.  Spoken Language Understanding on the Edge , 2018, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[9]  Sanjeev Khudanpur,et al.  JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS , 2015, 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).

[10]  Sanjeev Khudanpur,et al.  Investigation of transfer learning for ASR using LF-MMI trained neural networks , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Aaron Lawson,et al.  Exploring the role of phonetic bottleneck features for speaker and language recognition , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[13]  Hugo Van hamme,et al.  Capsule Networks for Low Resource Spoken Language Understanding , 2018, INTERSPEECH.

[14]  Hugo Van hamme,et al.  Acquisition of ordinal words using weakly supervised NMF , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[15]  Srinivas Bangalore,et al.  Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[17]  Yannick Estève,et al.  Simulating ASR errors for training SLU systems , 2018, LREC.

[18]  Josephine Lau,et al.  Alexa, Are You Listening? , 2018, Proc. ACM Hum. Comput. Interact..

[19]  Jan Cernocký,et al.  Improved feature processing for deep neural networks , 2013, INTERSPEECH.

[20]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[21]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Shinji Watanabe,et al.  Multilingual Sequence-to-Sequence Speech Recognition: Architecture, Transfer Learning, and Language Modeling , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[23]  Thomas Fang Zheng,et al.  Transfer learning for speech and language processing , 2015, 2015 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA).