Optimize What Matters: Training DNN-Hmm Keyword Spotting Model Using End Metric

Deep Neural Network–Hidden Markov Model (DNN-HMM) based methods have been successfully used for many always-on keyword spotting algorithms that detect a wake word to trigger a device. The DNN predicts the state probabilities of a given speech frame, while HMM decoder combines the DNN predictions of multiple speech frames to compute the keyword detection score. The DNN, in prior methods, is trained independent of the HMM parameters to minimize the cross-entropy loss between the predicted and the ground-truth state probabilities. The mismatch between the DNN training loss (cross-entropy) and the end metric (detection score) is the main source of sub-optimal performance for the keyword spotting task. We address this loss-metric mismatch with a novel end-to-end training strategy that learns the DNN parameters by optimizing for the detection score. To this end, we make the HMM decoder (dynamic programming) differentiable and back-propagate through it to maximize the score for the keyword and minimize the scores for non-keyword speech segments. Our method does not require any change in the model architecture or the inference framework; there-fore, there is no overhead in run-time memory or compute requirements. Moreover, we show significant reduction in false rejection rate (FRR) at the same false trigger experience (> 70% over independent DNN training).

[1]  Takuya Higuchi,et al.  Stacked 1D convolutional networks for end-to-end small footprint voice trigger detection , 2020, INTERSPEECH.

[2]  Hyun-Jin Park,et al.  End-to-end Streaming Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Erik Marchi,et al.  Efficient Voice Trigger Detection for Low Resource Hardware , 2018, INTERSPEECH.

[4]  Kai Yu,et al.  Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting , 2018, Speech Commun..

[5]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[6]  Jimmy J. Lin,et al.  Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Nikko Strom,et al.  Compressed Time Delay Neural Network for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[8]  Ali Farhadi,et al.  YOLO9000: Better, Faster, Stronger , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Awni Y. Hannun,et al.  An End-to-End Architecture for Keyword Spotting and Voice Activity Detection , 2016, ArXiv.

[10]  Minjae Lee,et al.  Online Keyword Spotting with a Character-Level Recurrent Neural Network , 2015, ArXiv.

[11]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[14]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15]  Björn W. Schuller,et al.  Keyword spotting exploiting Long Short-Term Memory , 2013, Speech Commun..

[16]  Samy Bengio,et al.  Discriminative keyword spotting , 2009, Speech Commun..

[17]  Jürgen Schmidhuber,et al.  An Application of Recurrent Neural Networks to Discriminative Keyword Spotting , 2007, ICANN.

[18]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[19]  Lorenzo Rosasco,et al.  Are Loss Functions All the Same? , 2004, Neural Computation.

[20]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[21]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[22]  John S. Bridle,et al.  Alpha-nets: A recurrent 'neural' network architecture with a hidden Markov model interpretation , 1990, Speech Commun..

[23]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[24]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[25]  M. Picheny,et al.  Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences , 2017 .

[26]  Rohit Prabhavalkar,et al.  Compressing deep neural networks using a rank-constrained topology , 2015, INTERSPEECH.

[27]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[28]  I-Fan Chen,et al.  A hybrid HMM/DNN approach to keyword spotting of short words , 2013, INTERSPEECH.

[29]  P. Mermelstein,et al.  Distance measures for speech recognition, psychological and instrumental , 1976 .