End-to-end Models with auditory attention in Multi-channel Keyword Spotting

In this paper, we propose an attention-based end-to-end model for multi-channel keyword spotting (KWS), which is trained to optimize the KWS result directly. As a result, our model outperforms the baseline model with signal pre-processing techniques in both the clean and noisy testing data. We also found that multi-task learning results in a better performance when the training and testing data are similar. Transfer learning and multi-target spectral mapping can dramatically enhance the robustness to the noisy environment. At 0.1 false alarm (FA) per hour, the model with transfer learning and multi-target mapping gain an absolute 30% improvement in the wake-up rate in the noisy data with SNR about -20.

[1]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[2]  Richard F. Lyon,et al.  Trainable frontend for robust and far-field keyword spotting , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Steve Renals,et al.  Neural networks for distant speech recognition , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[4]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[5]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[6]  M.L. Seltzer Bridging the Gap: Towards a Unified Framework for Hands-Free Speech Recognition Using Microphone Arrays , 2008, 2008 Hands-Free Speech Communication and Microphone Arrays.

[7]  Quan Wang,et al.  Attention-Based Models for Text-Dependent Speaker Verification , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Jean-Marc Valin,et al.  On Adjusting the Learning Rate in Frequency Domain Echo Cancellation With Double-Talk , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  François Michaud,et al.  The ManyEars open framework , 2013, Autonomous Robots.

[10]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[11]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[12]  Thomas Hain,et al.  Using neural network front-ends on far field multiple microphones based speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Bin Liu,et al.  End-to-end keywords spotting based on connectionist temporal classification for Mandarin , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[14]  Ian Lane,et al.  Recurrent Models for Auditory Attention in Multi-Microphone Distant Speech Recognition , 2016, INTERSPEECH.

[15]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[16]  Lei Xie,et al.  Attention-based End-to-End Models for Small-Footprint Keyword Spotting , 2018, INTERSPEECH.

[17]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[18]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[19]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[20]  John R. Hershey,et al.  Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks , 2015, INTERSPEECH.