论文信息 - A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting

A Monaural Speech Enhancement Method for Robust Small-Footprint Keyword Spotting

Robustness against noise is critical for keyword spotting (KWS) in real-world environments. To improve the robustness, a speech enhancement front-end is involved. Instead of treating the speech enhancement as a separated preprocessing before the KWS system, in this study, a pre-trained speech enhancement front-end and a convolutional neural networks (CNNs) based KWS system are concatenated, where a feature transformation block is used to transform the output from the enhancement front-end into the KWS system's input. The whole model is trained jointly, thus the linguistic and other useful information from the KWS system can be back-propagated to the enhancement front-end to improve its performance. To fit the small-footprint device, a novel convolution recurrent network is proposed, which needs fewer parameters and computation and does not degrade performance. Furthermore, by changing the input features from the power spectrogram to Mel-spectrogram, less computation and better performance are obtained. our experimental results demonstrate that the proposed method significantly improves the KWS system with respect to noise robustness.

Hui Zhang | Zhihao Du | Yue Gu | Xueliang Zhang

[1] Jimmy J. Lin,et al. Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2] Yongqiang Wang,et al. An investigation of deep neural networks for noise robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[4] DeLiang Wang,et al. Supervised Speech Separation Based on Deep Learning: An Overview , 2017, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[5] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[6] DeLiang Wang,et al. Ideal ratio mask estimation using deep neural networks for robust speech recognition , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7] DeLiang Wang,et al. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement , 2018, INTERSPEECH.

[8] Lei Xie,et al. Attention-based End-to-End Models for Small-Footprint Keyword Spotting , 2018, INTERSPEECH.

[9] Jimmy J. Lin,et al. Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting , 2017, ArXiv.

[10] DeLiang Wang,et al. On Training Targets for Supervised Speech Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[11] Pete Warden,et al. Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[12] Geoffrey E. Hinton,et al. Rectified Linear Units Improve Restricted Boltzmann Machines , 2010, ICML.

[13] Tara N. Sainath,et al. Learning filter banks within a deep neural network framework , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[14] Jun Du,et al. Robust speech recognition with speech enhanced deep neural networks , 2014, INTERSPEECH.

[15] Tara N. Sainath,et al. Automatic gain control and multi-style training for robust small-footprint keyword spotting with deep neural networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16] Zhong-Qiu Wang,et al. A Joint Training Framework for Robust Automatic Speech Recognition , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[17] Dong Yu,et al. Text-Dependent Speech Enhancement for Small-Footprint Robust Keyword Detection , 2018, INTERSPEECH.

[18] Tara N. Sainath,et al. Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.