论文信息 - 9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

Automatic speech recognition (ASR) using deep learning is essential for user interfaces on IoT devices. However, previously published ASR chips [4 –7] do not consider realistic operating conditions, which are typically noisy and may include more than one speaker. Furthermore, several of these works have implemented only small-vocabulary tasks, such as keyword-spotting (KWS), where context-blind deep neural network (DNN) algorithms are adequate. However, for large-vocabulary tasks (e.g., $\gt100\mathrm{k}$ words), the more complex bidirectional RNNs with an attention mechanism [1] provide context learning in long sequences, which improve ASR accuracy by up to 62% on the 200kwords LibriSpeech dataset, compared to a simpler unidirectional RNN (Fig. 9.8.1). Attention-based networks emphasize the most relevant parts of the source sequence during each decoding time step. In doing so, the encoder sequence is treated as a soft-addressable memory whose positions are weighted based on the state of the decoder RNN. Bidirectional RNNs learn past and future temporal information by concatenating forward and backward time steps.

[1] Anantha P. Chandrakasan,et al. A Low-Power Speech Recognizer and Voice Activity Detector Using Deep Neural Networks , 2018, IEEE Journal of Solid-State Circuits.

[2] Rob A. Rutenbar,et al. A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm , 2020, 2020 IEEE Symposium on VLSI Circuits.

[3] Hugo Van hamme,et al. 18μW SoC for near-microphone Keyword Spotting and Speaker Verification , 2019, 2019 Symposium on VLSI Circuits.

[4] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[5] Leibo Liu,et al. A 141 UW, 2.46 PJ/Neuron Binarized Convolutional Neural Network Based Self-Learning Speech Recognition Processor in 28NM CMOS , 2018, 2018 IEEE Symposium on VLSI Circuits.

[6] Rob A. Rutenbar,et al. Stereophonic spectrogram segmentation using Markov random fields , 2012, 2012 IEEE International Workshop on Machine Learning for Signal Processing.

[7] Xi Chen,et al. A 5.1pJ/Neuron 127.3us/Inference RNN-based Speech Recognition Processor using 16 Computing-in-Memory SRAM Macros in 65nm CMOS , 2019, 2019 Symposium on VLSI Circuits.