Multi-Scale Convolution for Robust Keyword Spotting

We propose a robust small-footprint keyword spotting system for resource-constrained devices. Small footprint is achieved by the use of depthwise-separable convolutions in a ResNet framework. Noise robustness is achieved with a multi-scale ensemble of classifiers: each classifier is specialized for a different view of the input, while the whole ensemble remains compact in size by heavy parameter sharing. Extensive experiments on public Google Command dataset demonstrate the effectiveness of our proposed method.

[1]  Dongyoung Kim,et al.  Temporal Convolution for Real-time Keyword Spotting on Mobile Devices , 2019, INTERSPEECH.

[2]  Shouyi Yin,et al.  Small-Footprint Keyword Spotting with Graph Convolutional Network , 2019, 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).

[3]  Tomi Kinnunen,et al.  INTERSPEECH 2013 14thAnnual Conference of the International Speech Communication Association , 2013, Interspeech 2015.

[4]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[5]  Jimmy J. Lin,et al.  Deep Residual Learning for Small-Footprint Keyword Spotting , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[7]  Jiangyan Yi,et al.  A Time Delay Neural Network with Shared Weight Self-Attention for Small-Footprint Keyword Spotting , 2019, INTERSPEECH.

[8]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[9]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.

[10]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Yonghong Yan,et al.  Deep neural network based wake-up-word speech recognition with two-stage detection , 2017, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  François Chollet,et al.  Xception: Deep Learning with Depthwise Separable Convolutions , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[14]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).