论文信息 - Specializing FGPU for Persistent Deep Learning

Specializing FGPU for Persistent Deep Learning

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance when compared to fully customized FPGA designs. When used in concert with a hand-tuned FPGA solution, a performant overlay architecture can improve the time-to-solution and thus overall productivity of FPGA solutions. In this work, we tune and specialize FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our PDL-FGPU architecture is able to maintain the ease-of-programming and generality of a software programmable soft GPU while achieving high performance due to specialization in the persistent deep learning domain. We also propose a easy method to specialize for different domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA running a set of persistent DL applications (RNN, GRU, LSTM), as well as general non-DL applications to demonstrate generality. PDL-FGPU requires 1.5-3x more ALMs, 4.4-6.4x more M20ks, and 4.6-10x more DSPs than the FGPU baseline, but improves performance by 55-727x for persistent DL applications with an average 15% degradation on general non-PDL applications. We also demonstrate that the PDL-FGPU is only 4-7x slower than the Nvidia Volta V100 GPU.

[1] Norbert Wehn,et al. FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Precision LSTM Networks on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[2] J. Gregory Steffan,et al. A GPU-inspired soft processor for high-throughput acceleration , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[3] Karthikeyan Sankaralingam,et al. MIAOW - An open source RTL implementation of a GPGPU , 2015, 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII).

[4] Hari Angepat,et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[5] Eriko Nurvitadhi,et al. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[6] Jeff Pool,et al. Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip , 2018, ICLR.

[7] Gregory K. Chen,et al. Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[8] Jonathan Rose,et al. VESPA: portable, scalable, and flexible FPGA-based vector processors , 2008, CASES '08.

[9] Michael Hübner,et al. General-Purpose Computing with Soft GPUs on FPGAs , 2018, ACM Trans. Reconfigurable Technol. Syst..

[10] Guy Lemieux,et al. Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12] Erich Elsen,et al. Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.