Energy-Friendly Keyword Spotting System Using Add-Based Convolution

Wake-up keyword of a keyword spotting (KWS) system represents brand name of a smart device. Performance of KWS is also crucial for modern speech based human-device interaction. An on-device KWS with both high accuracy and low power consumption is desired. We propose a KWS with addbased convolution layers, namely Add TC-ResNet. Add-based convolution paves a new way to reduce power consumption of KWS system, as addition is more energy efficient than multiplication at hardware level. On Google Speech Commands dataset V2, Add TC-ResNet achieves an accuracy of 97.1%, with 99% of multiplication operations are replaced by addition operations. The result is competitive to a state-of-the-art fully multiplication-based TC-ResNet KWS. We also investigate knowledge distillation and a mixed addition-multiplication design for the proposed KWS, which leads to further performance improvement.

[1]  Nikos Komodakis,et al.  Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer , 2016, ICLR.

[2]  Tara N. Sainath,et al.  Convolutional neural networks for small-footprint keyword spotting , 2015, INTERSPEECH.

[3]  Quoc V. Le,et al.  SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition , 2019, INTERSPEECH.

[4]  Niranjan A. Subrahmanya,et al.  Streaming keyword spotting on mobile devices , 2020, INTERSPEECH.

[5]  Hyun-Jin Park,et al.  Learning to Detect Keyword Parts and Whole by Smoothed Max Pooling , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Dongyoung Kim,et al.  Temporal Convolution for Real-time Keyword Spotting on Mobile Devices , 2019, INTERSPEECH.

[7]  Thibault Gisselbrecht,et al.  Small-Footprint Open-Vocabulary Keyword Spotting with Quantized LSTM Networks , 2020, ArXiv.

[8]  Yundong Zhang,et al.  Hello Edge: Keyword Spotting on Microcontrollers , 2017, ArXiv.

[9]  Lei Xie,et al.  Attention-based End-to-End Models for Small-Footprint Keyword Spotting , 2018, INTERSPEECH.

[10]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11]  Junmo Kim,et al.  A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[12]  Chao Xu,et al.  AdderNet: Do We Really Need Multiplications in Deep Learning? , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Dacheng Tao,et al.  AdderNet and its Minimalist Hardware Design for Energy-Efficient Artificial Intelligence , 2021, ArXiv.

[14]  Sankaran Panchapagesan,et al.  Model Compression Applied to Small-Footprint Keyword Spotting , 2016, INTERSPEECH.

[15]  Alvarez Raziel,et al.  End-to-end Streaming Keyword Spotting , 2018, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[16]  Georg Heigold,et al.  Small-footprint keyword spotting using deep neural networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Sercan Ömer Arik,et al.  Convolutional Recurrent Neural Networks for Small-Footprint Keyword Spotting , 2017, INTERSPEECH.

[18]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[19]  Mark Horowitz,et al.  1.1 Computing's energy problem (and what we can do about it) , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[20]  Pete Warden,et al.  Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition , 2018, ArXiv.