A 1μW voice activity detector using analog feature extraction and digital deep neural network

Voice user interfaces (UIs) are highly compelling for wearable and mobile devices. They have the advantage of using compact and ultra-low-power (ULP) input devices (e.g. passive microphones). Together with ULP signal acquisition and processing, voice UIs can give energy-harvesting acoustic sensor nodes and battery-operating devices the sought-after capability of natural interaction with humans. Voice activity detection (VAD), separating speech from background noise, is a key building block in such voice UIs, e.g. it can enable power gating of higher-level speech tasks such as speaker identification and speech recognition [1]. As an always-on block, the power consumption of VAD must be minimized and meanwhile maintain high classification accuracy. Motivated by high power efficiency of analog signal processing, a VAD system using analog feature extraction (AFE) and mixed-signal decision tree (DT) classifier was demonstrated in [2]. While it achieved a record of 6μW, the system requires machine-learning based calibration of the DT thresholds on a chip-to-chip basis due to ill-controlled AFE variation. Moreover, the 7-node DT may deliver inferior classification accuracy especially under low input SNR and difficult noise scenario, compared to more advanced classifiers like deep neural networks (DNNs) [1,3]. Although heavy computational load in conventional floating-point DNNs prevents their adoption in embedded systems, the binarized neural networks (BNNs) with binary weights and activations proposed in [4] may pave the way to ULP implementations. In this paper, we present a 1μW VAD system utilizing AFE and a digital BNN classifier with an event-encoding A/D interface. The whole AFE is 9.4x more power-efficient than the prior art [5] and 7.9x than the state-of-the-art digital filter bank [6], and the BNN consumes only 0.63μW. To avoid costly chip-wise training, a variation-aware python model of the AFE was created and the generated features were used for offline BNN training. Measurements show 84.4%/85.4% mean speech/non-speech hit rate with 1.88%/4.65% 1-σ standard deviation among 10 dies using the same weights for 10dB SNR speech with restaurant noise.

[1]  Ran El-Yaniv,et al.  Binarized Neural Networks , 2016, ArXiv.

[2]  Marian Verhelst,et al.  24.2 Context-aware hierarchical information-sensing in a 6μW 90nm CMOS voice activity detector , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.

[3]  James R. Glass,et al.  14.4 A scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).

[4]  Tobi Delbrück,et al.  A 0.5V 55μW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing , 2016, 2016 IEEE International Solid-State Circuits Conference (ISSCC).

[5]  Tobi Delbrück,et al.  22.5 A 0.5V 55µW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing , 2016, ISSCC.

[6]  DeLiang Wang,et al.  Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection , 2016, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[7]  Marios C. Papaefthymiou,et al.  20.7 A 13.8µW binaural dual-microphone digital ANSI S1.11 filter bank for hearing aids with zero-short-circuit-current logic in 65nm CMOS , 2017, 2017 IEEE International Solid-State Circuits Conference (ISSCC).