An Efficient Temporal Model for Small-Footprint Keyword Spotting

Keyword spotting (KWS), as an essential part of human-computer interaction, is widely used in mobile device terminals. However, the hardware resources of these devices are usually limited, so running on these devices requires a small memory footprint. However, previous works still need massive parameters to achieve high performance. In this work, we propose a context-dependent and compact network for small-footprint KWS. Firstly, to reduce the running time, we apply a sub-sampling technique in which hidden activation values are calculated in a few time steps based on time delay neural network (TDNN). Secondly, to take full advantage of the global context information of the feature maps, we utilize a squeeze-and-excitation block to emphasize the most discriminating area and distinguish the speech and non-speech regions. Finally, we conduct extensive experiments with the publicly available Google Speech Commands dataset and the private Biaobei Chinese Speech Commands dataset. The experimental results on the public dataset verify that the classification error rate of our method reaches 3.56% with only 11K parameters and 322K multiplications, which achieves state-of-the-art performance with the fewest parameters and multiplications.