Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition

Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Specifically, we leverage the popular representation based meta-learning learning to build a taskagnostic representation of utterances, that then use a linear classifier for prediction. We evaluate three such approaches on our novel experimental protocol developed on two popular spoken intent classification datasets: Google Commands and the Fluent Speech Commands dataset. For a 5-shot (1-shot) classification of novel classes, the proposed framework provides an average classification accuracy of 88.6% (76.3%) on the Google Commands dataset, and 78.5% (64.2%) on the Fluent Speech Commands dataset. The performance is comparable to traditionally supervised classification models with abundant training samples.

[1]  Sanjeev Khudanpur,et al.  Librispeech: An ASR corpus based on public domain audio books , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Subhransu Maji,et al.  Meta-Learning With Differentiable Convex Optimization , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[3]  Chao Wang,et al.  Sub-band Convolutional Neural Networks for Small-footprint Spoken Term Classification , 2019, INTERSPEECH.

[4]  Sunil Kumar Kopparapu,et al.  End-to-End Spoken Language Understanding: Bootstrapping in Low Resource Scenarios , 2019, INTERSPEECH.

[5]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[6]  Y. Nesterov A method for unconstrained convex minimization problem with the rate of convergence o(1/k^2) , 1983 .

[7]  Richard Socher,et al.  Quasi-Recurrent Neural Networks , 2016, ICLR.

[8]  Qing Li,et al.  An Investigation of Few-Shot Learning in Spoken Term Classification , 2020, INTERSPEECH.

[9]  Luca Bertinetto,et al.  Meta-learning with differentiable closed-form solvers , 2018, ICLR.

[10]  Nikos Komodakis,et al.  Dynamic Few-Shot Visual Learning Without Forgetting , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[11]  Oriol Vinyals,et al.  Matching Networks for One Shot Learning , 2016, NIPS.

[12]  Richard S. Zemel,et al.  Prototypical Networks for Few-shot Learning , 2017, NIPS.

[13]  Yoshua Bengio,et al.  Multi-Task Self-Supervised Learning for Robust Speech Recognition , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yannick Estève,et al.  End-To-End Named Entity And Semantic Concept Extraction From Speech , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[15]  David Suendermann-Oeft,et al.  Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[16]  Srinivas Bangalore,et al.  Spoken Language Understanding without Speech Recognition , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Yoshua Bengio,et al.  Speaker Recognition from Raw Waveform with SincNet , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[18]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Yoshua Bengio,et al.  Speech Model Pre-training for End-to-End Spoken Language Understanding , 2019, INTERSPEECH.

[20]  Brian Kingsbury,et al.  Leveraging Unpaired Text Data for Training End-To-End Speech-to-Intent Systems , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sergey Levine,et al.  Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks , 2017, ICML.

[22]  Frank K. Soong,et al.  From Speech Signals to Semantics — Tagging Performance at Acoustic, Phonetic and Word Levels , 2018, 2018 11th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[23]  Shrikanth Narayanan,et al.  Meta-Learning for Robust Child-Adult Classification from Speech , 2019, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Hugo Van hamme,et al.  Multitask Learning with Capsule Networks for Speech-to-Intent Applications , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Yannick Estève,et al.  End-to-end named entity extraction from speech , 2018, ArXiv.

[26]  Yannick Estève,et al.  Curriculum-based transfer learning for an effective end-to-end spoken language understanding and domain portability , 2019, INTERSPEECH.

[27]  Arun Narayanan,et al.  From Audio to Semantics: Approaches to End-to-End Spoken Language Understanding , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[28]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[29]  Yongqiang Wang,et al.  Towards End-to-end Spoken Language Understanding , 2018, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).