论文信息 - Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

[1] Yiyu Shi,et al. Resource constrained cellular neural networks for real-time obstacle detection using FPGAs , 2018, 2018 19th International Symposium on Quality Electronic Design (ISQED).

[2] Edwin Hsing-Mean Sha,et al. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3] Yu Cao,et al. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[4] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5] Tulika Mitra,et al. OPTiC: Optimizing Collaborative CPU–GPU Computing on Mobile Devices With Thermal Constraints , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6] Jakob Engblom,et al. The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[7] Lei Yang,et al. Optimal Application Mapping and Scheduling for Network-on-Chips with Computation in STT-RAM Based Router , 2019, IEEE Transactions on Computers.

[8] WilhelmReinhard,et al. The worst-case execution-time problemoverview of methods and survey of tools , 2008 .

[9] Jinjun Xiong,et al. DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10] Yi Wang,et al. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture , 2019, IEEE Transactions on Parallel and Distributed Systems.

[11] Chen Yang,et al. FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12] Junzhong Shen,et al. Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[13] Erik Cambria,et al. Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[14] Song Han,et al. Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[15] Ali Farhadi,et al. You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16] Peng Chen,et al. Task mapping on SMART NoC: Contention matters, not the distance , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17] Yi Wang,et al. Towards Cross-Platform Inference on Edge Devices with Emerging Neuromorphic Architecture , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18] Nong Xiao,et al. An Efficient Mapping Approach to Large-Scale DNNs on Multi-FPGA Architectures , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19] Jason Cong,et al. Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster , 2016, ISLPED.

[20] Christos-Savvas Bouganis,et al. fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[21] Soheil Ghiasi,et al. Cappuccino: Efficient CNN Inference Software Synthesis for Mobile System-on-Chips , 2019, IEEE Embedded Systems Letters.

[22] Yi Wang,et al. Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture , 2018, IEEE Transactions on Parallel and Distributed Systems.

[23] Eric S. Chung,et al. A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24] Yong Wang,et al. SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25] Lei Yang,et al. Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[26] Edwin Hsing-Mean Sha,et al. FoToNoC: A Folded Torus-Like Network-on-Chip Based Many-Core Systems-on-Chip in the Dark Silicon Era , 2017, IEEE Transactions on Parallel and Distributed Systems.

[27] Hari Angepat,et al. Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[28] Yiyu Shi,et al. Hardware/Software Co-Exploration of Neural Architectures , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29] Jason Cong,et al. Scaling for edge inference of deep neural networks , 2018 .

[30] Jinjun Xiong,et al. On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks , 2018, ICLR.

[31] Junzhong Shen,et al. Accelerating 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System , 2019, FPGA.

[32] Jason Cong,et al. Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[33] Mert R. Sabuncu,et al. An Unsupervised Learning Model for Deformable Medical Image Registration , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34] Yu Cao,et al. Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[35] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36] Paramvir Bahl,et al. Live Video Analytics at Scale with Approximation and Delay-Tolerance , 2017, NSDI.

[37] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38] Xiaobo Sharon Hu,et al. Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39] Michael Ferdman,et al. Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).