Deep Residual Learning for Small-Footprint Keyword Spotting

We explore the application of deep residual learning and dilated convolutions to the keyword spotting task, using the recently-released Google Speech Commands Dataset as our benchmark. Our best residual network (ResNet) implementation significantly outperforms Google's previous convolutional neural networks in terms of accuracy. By varying model depth and width, we can achieve compact models that also outperform previous small-footprint variants. To our knowledge, we are the first to examine these approaches for keyword spotting, and our results establish an open-source state-of-the-art reference to support the development of future speech-based interfaces.

[1]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[2]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[4]  Chunlei Zhang,et al.  End-to-End Text-Independent Speaker Verification with Triplet Loss on Short Utterances , 2017, INTERSPEECH.

[5]  Jimmy J. Lin,et al.  Honk: A PyTorch Reimplementation of Convolutional Neural Networks for Keyword Spotting , 2017, ArXiv.

[6]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[8]  Geoffrey Zweig,et al.  The microsoft 2016 conversational speech recognition system , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[10]  Vladlen Koltun,et al.  Multi-Scale Context Aggregation by Dilated Convolutions , 2015, ICLR.

[11]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.