Specializing FGPU for Persistent Deep Learning

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance when compared to fully customized FPGA designs. When used in concert with a hand-tuned FPGA solution, a performant overlay architecture can improve the time-to-solution and thus overall productivity of FPGA solutions. In this work, we tune and specialize FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our PDL-FGPU architecture is able to maintain the ease-of-programming and generality of a software programmable soft GPU while achieving high performance due to specialization in the persistent deep learning domain. We also propose a easy method to specialize for different domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA running a set of persistent DL applications (RNN, GRU, LSTM), as well as general non-DL applications to demonstrate generality. PDL-FGPU requires 1.5-3x more ALMs, 4.4-6.4x more M20ks, and 4.6-10x more DSPs than the FGPU baseline, but improves performance by 55-727x for persistent DL applications with an average 15% degradation on general non-PDL applications. We also demonstrate that the PDL-FGPU is only 4-7x slower than the Nvidia Volta V100 GPU.

[1]  Norbert Wehn,et al.  FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Precision LSTM Networks on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[2]  J. Gregory Steffan,et al.  A GPU-inspired soft processor for high-throughput acceleration , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[3]  Karthikeyan Sankaralingam,et al.  MIAOW - An open source RTL implementation of a GPGPU , 2015, 2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII).

[4]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[5]  Eriko Nurvitadhi,et al.  Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Jeff Pool,et al.  Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip , 2018, ICLR.

[7]  Gregory K. Chen,et al.  Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[8]  Jonathan Rose,et al.  VESPA: portable, scalable, and flexible FPGA-based vector processors , 2008, CASES '08.

[9]  Michael Hübner,et al.  General-Purpose Computing with Soft GPUs on FPGAs , 2018, ACM Trans. Reconfigurable Technol. Syst..

[10]  Guy Lemieux,et al.  Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor , 2013, 2013 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS).

[11]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[12]  Erich Elsen,et al.  Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.