论文信息 - Recurrent Neural Network Based Small-footprint Wake-up-word Speech Recognition System with a Score Calibration Method

Recurrent Neural Network Based Small-footprint Wake-up-word Speech Recognition System with a Score Calibration Method

In this paper, we propose a small-footprint wake-up-word speech recognition (WUWSR) system based on long short-term memory (LSTM) recurrent neural network, and we design a novel back-end calibration scoring method named modified zero normalization (MZN). First, LSTM is trained to predict posterior probability of context-dependent state. Next, MZN is adopted to transfer posterior probability to normalized score, which is then converted to confidence score by dynamic programming. Finally, a certain wake-up-word is recognized according to the confidence score. This WUWSR system can recognize multiple wake-up words and change wake-up words flexibly. This system can guarantee low latency by omitting decoding network. Equal error rate (EER) is adopted as the evaluation metric. Experimental results show that the proposed LSTM-based system achieves 33.33% relative improvement compared with a baseline system based on deep feed-forward neural network. Combining the front-end LSTM acoustic model with back-end MZN method, our WUWSR system can achieve 51.92% relative improvement.

[1] Pascale Fung,et al. HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[2] Alex Graves,et al. Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[3] Roland Auckenthaler,et al. Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[4] Lei Xie,et al. Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[5] V. Kepuska,et al. A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation , 2009 .

[6] Shuang Xu,et al. Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling , 2016, INTERSPEECH.

[7] Yonghong Yan,et al. Deep neural network based wake-up-word speech recognition with two-stage detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8] Yong Qin,et al. Wake-up-word spotting using end-to-end deep neural network system , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[9] Herbert Gish,et al. Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[10] Tara N. Sainath,et al. Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11] Jwu-Sheng Hu,et al. Wake-up-word detection for robots using spatial eigenspace consistency and resonant curve similarity , 2011, 2011 IEEE International Conference on Robotics and Automation.

[12] Andrew W. Senior,et al. Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[13] Tara N. Sainath,et al. Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[14] Xiaodong Cui,et al. System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15] Ta Li,et al. Compact Wake-Up Word Speech Recognition on Embedded Platforms , 2014 .

[16] Rong Zheng,et al. Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17] Oded Ghitza,et al. Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[18] Biing-Hwang Juang,et al. Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19] Andrew W. Senior,et al. Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[20] Georg Heigold,et al. Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21] Sanjeev Khudanpur,et al. A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22] Patrick Wambacq,et al. Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.