NEURAL MODEL REPROGRAMMING WITH SIMILARITY BASED MAPPING FOR LOW-RESOURCE SPOKEN COMMAND CLASSIFICATION

In this study, we propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR) and build an AR-SCR system. The AR procedure aims at repurposing a pretrained SCR model (from the source domain) to modify the acoustic signals (from the target domain). To solve the label mis-matches between source and target domains and further improve the stability of AR, we propose a novel similarity-based label mapping technique to align classes. In addition, the transfer learning (TL) technique is combined with the original AR process to improve the model adaptation capability. We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech. Experimental results show that with a pretrained AM trained on a large-scale English dataset, the proposed AR-SCR system outperforms the current state-of-the-art results on Lithuanian and Arabic speech commands datasets, with only a limited amount of training data.

[1]  Dmitrij Šešok,et al.  Voice Activation for Low-Resource Languages , 2021, Applied Sciences.

[2]  Chao-Han Huck Yang,et al.  Voice2Series: Reprogramming Acoustic Models for Time Series Classification , 2021, ICML.

[3]  Ying-Hui Lai,et al.  A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology , 2021, Applied Sciences.

[4]  Roman Vygon,et al.  Learning Efficient Representations for Keyword Spotting with Triplet Loss , 2021, SPECOM.

[5]  Heung-Seon Oh,et al.  Wav2KWS: Transfer Learning From Speech Representations for Keyword Spotting , 2021, IEEE Access.

[6]  Dmitrij Šešok,et al.  Unsupervised Pre-Training for Voice Activation , 2020, Applied Sciences.

[7]  Brian Kingsbury,et al.  Representation Based Meta-Learning for Few-Shot Spoken Intent Recognition , 2020, INTERSPEECH.

[8]  Chin-Hui Lee,et al.  Relational Teacher Student Learning with Neural Label Embedding for Device Adaptation in Acoustic Scene Classification , 2020, INTERSPEECH.

[9]  Tsung-Yi Ho,et al.  Transfer Learning without Knowing: Reprogramming Black-box Machine Learning Models with Scarce Data and Limited Resources , 2020, ICML.

[10]  Abdel-rahman Mohamed,et al.  wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , 2020, NeurIPS.

[11]  Boris Ginsburg,et al.  MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition , 2020, INTERSPEECH.

[12]  James Lin,et al.  Training Keyword Spotters with Limited and Synthesized Speech Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Alexei Baevski,et al.  vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations , 2019, ICLR.

[14]  Sanath Jayasena,et al.  Speech Command Classification System for Sinhala Language based on Automatic Speech Recognition , 2019, 2019 International Conference on Asian Language Processing (IALP).

[15]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[16]  Ronan Collobert,et al.  wav2vec: Unsupervised Pre-training for Speech Recognition , 2019, INTERSPEECH.

[17]  Yoshua Bengio,et al.  Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks , 2019, INTERSPEECH.

[18]  D. Radha,et al.  Smart Phone as a Controlling Device for Smart Home using Speech Recognition , 2019, 2019 International Conference on Communication and Signal Processing (ICCSP).

[19]  Thomas Niesler,et al.  Feature Exploration for Almost Zero-Resource ASR-Free Keyword Spotting Using a Multilingual Bottleneck Extractor and Correspondence Autoencoders , 2018, INTERSPEECH.

[20]  Jascha Sohl-Dickstein,et al.  Adversarial Reprogramming of Neural Networks , 2018, ICLR.

[21]  Douglas Coimbra de Andrade,et al.  A neural attention model for speech command recognition , 2018, ArXiv.

[22]  Vikrant Singh Tomar,et al.  Efficient keyword spotting using time delay neural networks , 2018, INTERSPEECH.

[23]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[24]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.

[25]  Brian McMahan,et al.  Listening to the World Improves Speech Command Recognition , 2017, AAAI.

[26]  Vered Aharonson,et al.  Cross-language phoneme mapping for phonetic search keyword spotting in continuous speech of under-resourced languages , 2015, Artif. Intell. Res..

[27]  Veena Karjigi,et al.  Sensitive keyword spotting for crime analysis , 2014, 2014 IEEE National Conference on Communication, Signal Processing and Networking (NCCSN).

[28]  Tanja Schultz,et al.  Automatic speech recognition for under-resourced languages: A survey , 2014, Speech Commun..

[29]  Kaisheng Yao,et al.  Adaptation of context-dependent deep neural networks for automatic speech recognition , 2012, 2012 IEEE Spoken Language Technology Workshop (SLT).

[30]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[31]  Ciro Martins,et al.  Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system , 1995, EUROSPEECH.

[32]  Alex Waibel,et al.  A hybrid neural network, dynamic programming word spotter , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.