论文信息 - Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Why Compete When You Can Work Together: FPGA-ASIC Integration for Persistent RNNs

Interactive intelligent services, such as smart web search, are important datacenter workloads. They rely on dataintensive deep learning (DL) algorithms with strict latency constraints and thus require balancing both data movement and compute capabilities. As such, a persistent approach that keeps the entire DL model on-chip is becoming the new norm for realtime services to avoid the expensive off-chip memory accesses. This approach is adopted in Microsoft's Brainwave and is also provided by Nvidia's cuDNN libraries. This paper presents a comparative study of FPGA, GPU, and FPGA+ASIC in-package solutions for persistent DL. Unlike prior work, we offer a fair and direct comparison targeting common numerical precisions (FP32, INT8) and modern high-end FPGA (Intel® Stratix®10), GPU (Nvidia Volta), and ASIC (10 nm process), all using the persistent approach. We show that Stratix 10 FPGAs offer 2.7× (FP32) to 8.6× (INT8) lower latency than Volta GPUs across RNN, GRU, and LSTM workloads from DeepBench. The GPU can only utilize ~6% of its peak TOPS, while the FPGA with a more balanced on-chip memory and compute can achieve much higher utilization (~57%). We also study integrating an ASIC chiplet, TensorRAM, with an FPGA as system-in-package to enhance on-chip memory capacity and bandwidth, and provide compute throughput matching the required bandwidth. We show that a small 32 mm2 TensorRAM 10nm chiplet can offer 64 MB memory, 32 TB/s on-chiplet bandwidth, and 64 TOPS (INT8). A small Stratix 10 FPGA with a TensorRAM (INT8) offers 15.9× better latency than GPU (FP32) and 34× higher energy efficiency. It has 2× aggregate on-chip memory capacity compared to a large FPGA or GPU. Overall, our study shows that the FPGA is better than the GPU for persistent DL, and when integrated with an ASIC chiplet, it can offer a more compelling solution.

[1] Christoforos E. Kozyrakis,et al. A case for intelligent RAM , 1997, IEEE Micro.

[2] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[3] Vaughn Betz,et al. You Cannot Improve What You Do not Measure , 2018, ACM Trans. Reconfigurable Technol. Syst..

[4] Norbert Wehn,et al. FINN-L: Library Extensions and Design Trade-Off Analysis for Variable Precision LSTM Networks on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[5] Xi Chen,et al. FP-DNN: An Automated Framework for Mapping Deep Neural Networks onto FPGAs with RTL-HLS Hybrid Templates , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[6] David Blaauw,et al. Neural Cache: Bit-Serial In-Cache Acceleration of Deep Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[7] Song Han,et al. ESE: Efficient Speech Recognition Engine with Sparse LSTM on FPGA , 2016, FPGA.

[8] Aravind Dasu,et al. In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC , 2018, FPL.

[9] Mohamed S. Abdelfattah,et al. DLA: Compiler and FPGA Overlay for Neural Network Inference Acceleration , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[10] William J. Dally,et al. Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11] Zheng Guo,et al. A 23.6-Mb/mm $^{2}$ SRAM in 10-nm FinFET Technology With Pulsed-pMOS TVC and Stepped-WL for Low-Voltage Applications , 2019, IEEE Journal of Solid-State Circuits.

[12] Aravind Dasu,et al. In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[13] Eric S. Chung,et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[14] Christoforos E. Kozyrakis,et al. TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[15] Erich Elsen,et al. Persistent RNNs: Stashing Recurrent Weights On-Chip , 2016, ICML.

[16] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[17] Eriko Nurvitadhi,et al. Accelerating recurrent neural networks in analytics servers: Comparison of FPGA, CPU, GPU, and ASIC , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[18] Jeff Pool,et al. Sparse Persistent RNNs: Squeezing Large Recurrent Networks On-Chip , 2018, ICLR.

[19] C. Auth,et al. A 10nm high performance and low-power CMOS technology featuring 3rd generation FinFET transistors, Self-Aligned Quad Patterning, contact over active gate and cobalt local interconnects , 2017, 2017 IEEE International Electron Devices Meeting (IEDM).

[20] Ali Akoglu,et al. A power efficient reconfigurable system-in-stack: 3D integration of accelerators, FPGAs, and DRAM , 2014, 2014 27th IEEE International System-on-Chip Conference (SOCC).

[21] Martin Langhammer,et al. Activation Function Architectures for FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[22] Kiyoung Choi,et al. PIM-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[23] Martin Langhammer,et al. Floating-Point DSP Block Architecture for FPGAs , 2015, FPGA.

[24] Norbert Wehn,et al. Hardware architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[25] Eriko Nurvitadhi,et al. Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[26] George Kurian,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.