Voice user interfaces (UIs) are highly compelling for wearable and mobile devices. They have the advantage of using compact and ultra-low-power (ULP) input devices (e.g. passive microphones). Together with ULP signal acquisition and processing, voice UIs can give energy-harvesting acoustic sensor nodes and battery-operating devices the sought-after capability of natural interaction with humans. Voice activity detection (VAD), separating speech from background noise, is a key building block in such voice UIs, e.g. it can enable power gating of higher-level speech tasks such as speaker identification and speech recognition [1]. As an always-on block, the power consumption of VAD must be minimized and meanwhile maintain high classification accuracy. Motivated by high power efficiency of analog signal processing, a VAD system using analog feature extraction (AFE) and mixed-signal decision tree (DT) classifier was demonstrated in [2]. While it achieved a record of 6μW, the system requires machine-learning based calibration of the DT thresholds on a chip-to-chip basis due to ill-controlled AFE variation. Moreover, the 7-node DT may deliver inferior classification accuracy especially under low input SNR and difficult noise scenario, compared to more advanced classifiers like deep neural networks (DNNs) [1,3]. Although heavy computational load in conventional floating-point DNNs prevents their adoption in embedded systems, the binarized neural networks (BNNs) with binary weights and activations proposed in [4] may pave the way to ULP implementations. In this paper, we present a 1μW VAD system utilizing AFE and a digital BNN classifier with an event-encoding A/D interface. The whole AFE is 9.4x more power-efficient than the prior art [5] and 7.9x than the state-of-the-art digital filter bank [6], and the BNN consumes only 0.63μW. To avoid costly chip-wise training, a variation-aware python model of the AFE was created and the generated features were used for offline BNN training. Measurements show 84.4%/85.4% mean speech/non-speech hit rate with 1.88%/4.65% 1-σ standard deviation among 10 dies using the same weights for 10dB SNR speech with restaurant noise.
[1]
Ran El-Yaniv,et al.
Binarized Neural Networks
,
2016,
ArXiv.
[2]
Marian Verhelst,et al.
24.2 Context-aware hierarchical information-sensing in a 6μW 90nm CMOS voice activity detector
,
2015,
2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.
[3]
James R. Glass,et al.
14.4 A scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating
,
2017,
2017 IEEE International Solid-State Circuits Conference (ISSCC).
[4]
Tobi Delbrück,et al.
A 0.5V 55μW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing
,
2016,
2016 IEEE International Solid-State Circuits Conference (ISSCC).
[5]
Tobi Delbrück,et al.
22.5 A 0.5V 55µW 64×2-channel binaural silicon cochlea for event-driven stereo-audio sensing
,
2016,
ISSCC.
[6]
DeLiang Wang,et al.
Boosting Contextual Information for Deep Neural Network Based Voice Activity Detection
,
2016,
IEEE/ACM Transactions on Audio, Speech, and Language Processing.
[7]
Marios C. Papaefthymiou,et al.
20.7 A 13.8µW binaural dual-microphone digital ANSI S1.11 filter bank for hearing aids with zero-short-circuit-current logic in 65nm CMOS
,
2017,
2017 IEEE International Solid-State Circuits Conference (ISSCC).