Recurrent Neural Network Based Small-footprint Wake-up-word Speech Recognition System with a Score Calibration Method

In this paper, we propose a small-footprint wake-up-word speech recognition (WUWSR) system based on long short-term memory (LSTM) recurrent neural network, and we design a novel back-end calibration scoring method named modified zero normalization (MZN). First, LSTM is trained to predict posterior probability of context-dependent state. Next, MZN is adopted to transfer posterior probability to normalized score, which is then converted to confidence score by dynamic programming. Finally, a certain wake-up-word is recognized according to the confidence score. This WUWSR system can recognize multiple wake-up words and change wake-up words flexibly. This system can guarantee low latency by omitting decoding network. Equal error rate (EER) is adopted as the evaluation metric. Experimental results show that the proposed LSTM-based system achieves 33.33% relative improvement compared with a baseline system based on deep feed-forward neural network. Combining the front-end LSTM acoustic model with back-end MZN method, our WUWSR system can achieve 51.92% relative improvement.

[1]  Pascale Fung,et al.  HKUST/MTS: A Very Large Scale Mandarin Telephone Speech Corpus , 2006, ISCSLP.

[2]  Alex Graves,et al.  Supervised Sequence Labelling with Recurrent Neural Networks , 2012, Studies in Computational Intelligence.

[3]  Roland Auckenthaler,et al.  Score Normalization for Text-Independent Speaker Verification Systems , 2000, Digit. Signal Process..

[4]  Lei Xie,et al.  Investigating neural network based query-by-example keyword spotting approach for personalized wake-up word detection in Mandarin Chinese , 2016, 2016 10th International Symposium on Chinese Spoken Language Processing (ISCSLP).

[5]  V. Kepuska,et al.  A novel Wake-Up-Word speech recognition system, Wake-Up-Word recognition task, technology and evaluation , 2009 .

[6]  Shuang Xu,et al.  Multidimensional Residual Learning Based on Recurrent Neural Networks for Acoustic Modeling , 2016, INTERSPEECH.

[7]  Yonghong Yan,et al.  Deep neural network based wake-up-word speech recognition with two-stage detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Yong Qin,et al.  Wake-up-word spotting using end-to-end deep neural network system , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[9]  Herbert Gish,et al.  Rapid and accurate spoken term detection , 2007, INTERSPEECH.

[10]  Tara N. Sainath,et al.  Query-by-example keyword spotting using long short-term memory networks , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Jwu-Sheng Hu,et al.  Wake-up-word detection for robots using spatial eigenspace consistency and resonant curve similarity , 2011, 2011 IEEE International Conference on Robotics and Automation.

[12]  Andrew W. Senior,et al.  Fast and accurate recurrent neural network acoustic models for speech recognition , 2015, INTERSPEECH.

[13]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[14]  Xiaodong Cui,et al.  System combination and score normalization for spoken term detection , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[15]  Ta Li,et al.  Compact Wake-Up Word Speech Recognition on Embedded Platforms , 2014 .

[16]  Rong Zheng,et al.  Asynchronous stochastic gradient descent for DNN training , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[17]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[18]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[19]  Andrew W. Senior,et al.  Long short-term memory recurrent neural network architectures for large scale acoustic modeling , 2014, INTERSPEECH.

[20]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[21]  Sanjeev Khudanpur,et al.  A pitch extraction algorithm tuned for automatic speech recognition , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Patrick Wambacq,et al.  Template-Based Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.