Vocell: A 65-nm Speech-Triggered Wake-Up SoC for 10- $\mu$ W Keyword Spotting and Speaker Verification

The use of speech-triggered wake-up interfaces has grown significantly in the last few years for use in ubiquitous and mobile devices. Since these interfaces must always be active, power consumption is one of their primary design metrics. This article presents a complete mixed-signal system-on-chip, capable of directly interfacing to an analog microphone and performing keyword spotting (KWS) and speaker verification (SV), without any need for further external accesses. Through the use of: 1) an integrated single-chip digital-friendly design; b) hardware-aware algorithmic optimization; and c) memory- and power-optimized accelerators, ultra-low power is achieved while maintaining high accuracy for speech recognition tasks. The 65-nm implementation achieves 18.3-<inline-formula> <tex-math notation="LaTeX">$\mu \text{W}$ </tex-math></inline-formula> worst case power consumption or 10.6-<inline-formula> <tex-math notation="LaTeX">$\mu \text{W}$ </tex-math></inline-formula> power for typical real-time scenarios, <inline-formula> <tex-math notation="LaTeX">$10\times $ </tex-math></inline-formula> below state of the art (SoA).

[1]  Leibo Liu,et al.  A 141 UW, 2.46 PJ/Neuron Binarized Convolutional Neural Network Based Self-Learning Speech Recognition Processor in 28NM CMOS , 2018, 2018 IEEE Symposium on VLSI Circuits.

[2]  Douglas A. Reynolds,et al.  An overview of automatic speaker recognition technology , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  A. Balaji Ganesh,et al.  Speaker Identification System using Gaussian Mixture Model and Support Vector Machines (GMM-SVM) under Noisy Conditions , 2016 .

[4]  Marian Verhelst,et al.  Laika: A 5uW Programmable LSTM Accelerator for Always-on Keyword Spotting in 65nm CMOS , 2018, ESSCIRC 2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC).

[5]  Tan Lee,et al.  Speech recognition on DSP: issues on computational efficiency and performance analysis , 2005, Proceedings. 2005 International Conference on Communications, Circuits and Systems, 2005..

[6]  Lukás Burget,et al.  Support vector machines and Joint Factor Analysis for speaker verification , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Kai Yu,et al.  Sequence Discriminative Training for Deep Learning based Acoustic Keyword Spotting , 2018, Speech Commun..

[8]  Nikko Strom,et al.  Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting , 2016, 2016 IEEE Spoken Language Technology Workshop (SLT).

[9]  Rita Singh,et al.  Online word-spotting in continuous speech with recurrent neural networks , 2014, 2014 IEEE Spoken Language Technology Workshop (SLT).

[10]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[11]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[12]  Marian Verhelst,et al.  A 90 nm CMOS, $6\ {\upmu {\text{W}}}$ Power-Proportional Acoustic Sensing Frontend for Voice Activity Detection , 2016, IEEE Journal of Solid-State Circuits.

[13]  Marco Marcon,et al.  Studying the Effects of Feature Extraction Settings on the Accuracy and Memory Requirements of Neural Networks for Keyword Spotting , 2018, 2018 IEEE 8th International Conference on Consumer Electronics - Berlin (ICCE-Berlin).