E-LSTM: An Efficient Hardware Architecture for Long Short-Term Memory

Long Short-Term Memory (LSTM) and its variants have been widely adopted in many sequential learning tasks, such as speech recognition and machine translation. Significant accuracy improvements can be achieved using complex LSTM model with a large memory requirement and high computational complexity, which is time-consuming and energy demanding. The low-latency and energy-efficiency requirements of the real-world applications make model compression and hardware acceleration for LSTM an urgent need. In this paper, several hardware-efficient network compression schemes are introduced first, including structured top-<inline-formula> <tex-math notation="LaTeX">$k$ </tex-math></inline-formula> pruning, clipped gating, and multiplication-free quantization, to reduce the model size and the number of matrix operations by 32 <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> and 21.6 <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>, respectively, with negligible accuracy loss. Furthermore, efficient hardware architectures for accelerating the compressed LSTM are proposed, which support the inference of multi-layer and multiple time steps. The computation process is judiciously reorganized and the memory access pattern is well optimized, which alleviate the limited memory bandwidth bottleneck and enable higher throughput. Moreover, the parallel processing strategy is carefully designed to make full use of the sparsity introduced by pruning and clipped gating with high hardware utilization efficiency. Implemented on Intel Arria10 S<inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula>660 FPGA running at 200MHz, the proposed design is able to achieve 1.4–2.2 <inline-formula> <tex-math notation="LaTeX">$\times $ </tex-math></inline-formula> energy efficiency and requires significantly less hardware resources compared with the state-of-the-art LSTM implementations.

[1]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[2]  Zhongfeng Wang,et al.  Accelerating Recurrent Neural Networks: A Memory-Efficient Approach , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[3]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[4]  Daisuke Miyashita,et al.  Convolutional Neural Networks using Logarithmic Data Representation , 2016, ArXiv.

[5]  Song Han,et al.  ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[6]  Geoffrey E. Hinton,et al.  Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Zhongfeng Wang,et al.  SGAD: Soft-Guided Adaptively-Dropped Neural Network , 2018, ArXiv.

[8]  Zhongfeng Wang,et al.  Hardware-Oriented Compression of Long Short-Term Memory for Efficient Inference , 2018, IEEE Signal Processing Letters.

[9]  Antonio Rubio,et al.  Insights to memristive memory cell from a reliability perspective , 2015, 2015 International Conference on Memristive Systems (MEMRISYS).

[10]  Qinru Qiu,et al.  C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs , 2018, FPGA.

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Carla Teixeira Lopes,et al.  TIMIT Acoustic-Phonetic Continuous Speech Corpus , 2012 .

[13]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[14]  Jose-Maria Arnau,et al.  E-PUR: an energy-efficient processing unit for recurrent neural networks , 2017, PACT.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[18]  An Chen,et al.  Emerging nonvolatile memory (NVM) technologies , 2015, 2015 45th European Solid State Device Research Conference (ESSDERC).

[19]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[20]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[21]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[22]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Saibal Mukhopadhyay,et al.  ReRAM-Based Processing-in-Memory Architecture for Recurrent Neural Network Acceleration , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[24]  Wenyao Xu,et al.  E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs , 2018, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[25]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[26]  Nitish Srivastava,et al.  Unsupervised Learning of Video Representations using LSTMs , 2015, ICML.

[27]  Shuchang Zhou,et al.  Effective Quantization Methods for Recurrent Neural Networks , 2016, ArXiv.