Generating Adversarial Samples For Training Wake-up Word Detection Systems Against Confusing Words

Wake-up word detection models are widely used in real life, but suffer from severe performance degradation when encountering adversarial samples. In this paper we discuss the concept of confusing words in adversarial samples. Confusing words are commonly encountered, which are various kinds of words that sound similar to the predefined keywords. To enhance the wake word detection system’s robustness against confusing words, we propose several methods to generate the adversarial confusing samples for simulating real confusing words scenarios in which we usually do not have any real confusing samples in the training set. The generated samples include concatenated audio, synthesized data, and partially masked keywords. Moreover, we use a domain embedding concatenated system to improve the performance. Experimental results show that the adversarial samples generated in our approach help improve the system’s robustness in both the common scenario and the confusing words scenario. In addition, we release the confusing words testing database called HI-MIA-CW for future research.

[1]  Hulya Yalcin,et al.  Improving Low Resource Turkish Speech Recognition with Data Augmentation and TTS , 2019, 2019 16th International Multi-Conference on Systems, Signals & Devices (SSD).

[2]  Arindam Mandal,et al.  Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting , 2016, INTERSPEECH.

[3]  Bhiksha Raj,et al.  Environmental Noise Embeddings for Robust Speech Recognition , 2016, ArXiv.

[4]  L. G. Miller,et al.  Improvements and applications for key word recognition using hidden Markov modeling techniques , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[5]  Dongyoung Kim,et al.  Temporal Convolution for Real-time Keyword Spotting on Mobile Devices , 2019, INTERSPEECH.

[6]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[7]  Hao Zheng,et al.  AISHELL-1: An open-source Mandarin speech corpus and a speech recognition baseline , 2017, 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA).

[8]  James Lin,et al.  Training Keyword Spotters with Limited and Synthesized Speech Data , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Yiming Wang,et al.  Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI , 2016, INTERSPEECH.

[10]  Ming Li,et al.  From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint , 2020, INTERSPEECH.

[11]  R. Wohlford,et al.  Keyword recognition using template concatenation , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Ming Li,et al.  Domain Aware Training for Far-Field Small-Footprint Keyword Spotting , 2020, INTERSPEECH.

[13]  Nikko Strom,et al.  Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[14]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[15]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Hui Bu,et al.  AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale , 2018, ArXiv.

[17]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[18]  Quan Wang,et al.  Synth2Aug: Cross-Domain Speaker Recognition with TTS Synthesized Speech , 2020, 2021 IEEE Spoken Language Technology Workshop (SLT).

[19]  Jason S. Chang,et al.  Automatic Chinese Confusion Words Extraction Using Conditional Random Fields and the Web , 2013, SIGHAN@IJCNLP.

[20]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[21]  Boris Ginsburg,et al.  MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture for Speech Commands Recognition , 2020, INTERSPEECH.

[22]  Georgia Zellou,et al.  Perception of Concatenative vs. Neural Text-To-Speech (TTS): Differences in Intelligibility in Noise and Language Attitudes , 2020, INTERSPEECH.

[23]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[24]  Ming Li,et al.  HI-MIA: A Far-Field Text-Dependent Speaker Verification Database and the Baselines , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[25]  Tatsuya Kawahara,et al.  Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition , 2018, 2018 IEEE Spoken Language Technology Workshop (SLT).

[26]  Lei Xie,et al.  Wake Word Detection with Streaming Transformers , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[27]  Björn W. Schuller,et al.  Keyword spotting exploiting Long Short-Term Memory , 2013, Speech Commun..