论文信息 - Exploiting Model-Level Parallelism in Recurrent Neural Network Accelerators

Exploiting Model-Level Parallelism in Recurrent Neural Network Accelerators

Recurrent Neural Networks (RNNs) have continued to facilitate rapid progress in a variety of academic and industrial fields, though their complexity continues to make efficient deployment difficult; when the RNN model size is not properly matched to hardware resources, performance can suffer from hardware under-utilization. In this work, we propose to explore model-level parallelism for LSTM-RNN accelerators in different levels of the model using a multicore design. The multi-core design proposed in this work operates in three computing modes: multi-programming mode in which independent models are executed; multithreading mode in which parallelism among layers of an LSTM model is explored and properly scheduled; and helper-core mode in which cores collaborate on a single LSTM layer in a lower model level comparing with multithread mode. Our design can achieve up to 1.98x speedup in "multi-programming" mode, a 1.91x speedup in "multithreading" mode and a 1.88x speedup in "helper-core" mode over the single-core design.

[1] Eriko Nurvitadhi,et al. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[2] Geoffrey E. Hinton,et al. Speech recognition with deep recurrent neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3] Yu Wang,et al. FPGA Acceleration of Recurrent Neural Network Based Language Model , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[4] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[5] Yu Cao,et al. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[6] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[7] Trevor Darrell,et al. Long-term recurrent convolutional networks for visual recognition and description , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[9] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[10] Samy Bengio,et al. Show and tell: A neural image caption generator , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[12] Jia Wang,et al. DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[13] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14] Eugenio Culurciello,et al. Hardware accelerators for recurrent neural networks on FPGA , 2017, 2017 IEEE International Symposium on Circuits and Systems (ISCAS).

[15] Xuegong Zhou,et al. A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[16] Wonyong Sung,et al. Fixed-point performance analysis of recurrent neural networks , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).