A 8.93-TOPS/W LSTM Recurrent Neural Network Accelerator Featuring Hierarchical Coarse-Grain Sparsity With All Parameters Stored On-Chip

Long short-term memory (LSTM) networks are widely used for speech applications but pose difficulties for efficient implementation on hardware due to large weight storage requirements. We present an energy-efficient LSTM recurrent neural network (RNN) accelerator, featuring an algorithm-hardware co-optimized memory compression technique called hierarchical coarse-grain sparsity (HCGS). Aided by HCGS-based block-wise recursive weight compression, we demonstrate LSTM networks with up to $16\times $ fewer weights while achieving minimal accuracy loss. The prototype chip fabricated in 65-nm LP CMOS achieves 8.93/7.22 TOPS/W for 2-/3-layer LSTM RNNs trained with HCGS for TIMIT/TED-LIUM corpora.

[1]  Marian Verhelst,et al.  Laika: A 5uW Programmable LSTM Accelerator for Always-on Keyword Spotting in 65nm CMOS , 2018, ESSCIRC 2018 - IEEE 44th European Solid State Circuits Conference (ESSCIRC).

[2]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[3]  Leibo Liu,et al.  A 1.06-to-5.09 TOPS/W reconfigurable hybrid-neural-network processor for deep learning applications , 2017, 2017 Symposium on VLSI Circuits.

[4]  Andrew S. Cassidy,et al.  Convolutional networks for fast, energy-efficient neuromorphic computing , 2016, Proceedings of the National Academy of Sciences.

[5]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[6]  Chaitali Chakrabarti,et al.  Efficient memory compression in deep neural networks using coarse-grain sparsification for speech applications , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[7]  Andreas Stolcke,et al.  The Microsoft 2017 Conversational Speech Recognition System , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[8]  Fang Liu,et al.  Learning Intrinsic Sparse Structures within Long Short-term Memory , 2017, ICLR.

[9]  Luca Benini,et al.  Chipmunk: A systolically scalable 0.9 mm2, 3.08Gop/s/mW @ 1.2 mW accelerator for near-sensor recurrent neural network inference , 2017, 2018 IEEE Custom Integrated Circuits Conference (CICC).