Tiny-CRNN: Streaming Wakeword Detection in a Low Footprint Setting

In this work, we propose Tiny-CRNN (Tiny Convolutional Recurrent Neural Network) models applied to the problem of wakeword detection, and augment them with scaled dot product attention. We find that, compared to Convolutional Neural Network models, False Accepts in a 250k parameter budget can be reduced by 25% with a 10% reduction in parameter size by using models based on the Tiny-CRNN architecture, and we can get up to 32% reduction in False Accepts at a 50k parameter budget with 75% reduction in parameter size compared to word-level Dense Neural Network models. We discuss solutions to the challenging problem of performing inference on streaming audio with this architecture, as well as differences in start-end index errors and latency in comparison to CNN, DNN, and DNN-HMM models.

[1]  Brian Kulis,et al.  Building a Robust Word-Level Wakeword Verification Network , 2020, INTERSPEECH.

[2]  Arindam Mandal,et al.  Multi-Task Learning and Weighted Cross-Entropy for DNN-Based Keyword Spotting , 2016, INTERSPEECH.

[3]  Tobias May,et al.  A depthwise separable convolutional neural network for keyword spotting on an embedded system , 2020, EURASIP Journal on Audio, Speech, and Music Processing.

[4]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[5]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Zhou jianlai,et al.  Keyword spotting based on recurrent neural network , 1998, ICSP '98. 1998 Fourth International Conference on Signal Processing (Cat. No.98TH8344).

[7]  Yamin Wen,et al.  EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting , 2021, Journal of Ambient Intelligence and Humanized Computing.

[8]  Shiv Vitaladevuni,et al.  Accurate Detection of Wake Word Start and End Using a CNN , 2020, INTERSPEECH.

[9]  Niranjan A. Subrahmanya,et al.  Streaming keyword spotting on mobile devices , 2020, INTERSPEECH.

[10]  Wei Li,et al.  Streaming small-footprint keyword spotting using sequence-to-sequence models , 2017, 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[11]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[12]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[13]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[14]  Tao Zhang,et al.  On Front-end Gain Invariant Modeling for Wake Word Spotting , 2020, INTERSPEECH.