Achieving Super-Linear Speedup across Multi-FPGA for Real-Time DNN Inference

Real-time Deep Neural Network (DNN) inference with low-latency requirement has become increasingly important for numerous applications in both cloud computing (e.g., Apple’s Siri) and edge computing (e.g., Google/Waymo’s driverless car). FPGA-based DNN accelerators have demonstrated both superior flexibility and performance; in addition, for real-time inference with low batch size, FPGA is expected to achieve further performance improvement. However, the performance gain from the single-FPGA design is obstructed by the limited on-chip resource. In this paper, we employ multiple FPGAs to cooperatively run DNNs with the objective of achieving super-linear speed-up against single-FPGA design. In implementing such systems, we found two barriers that hinder us from achieving the design goal: (1) the lack of a clear partition scheme for each DNN layer to fully exploit parallelism, and (2) the insufficient bandwidth between the off-chip memory and the accelerator due to the growing size of DNNs. To tackle these issues, we propose a general framework, “Super-LIP”, which can support different kinds of DNNs. In this paper, we take Convolutional Neural Network (CNN) as a vehicle to illustrate Super-LIP. We first formulate an accurate system-level model to support the exploration of best partition schemes. Then, we develop a novel design methodology to effectively alleviate the heavy loads on memory bandwidth by moving traffic from memory bus to inter-FPGA links. We implement Super-LIP based on ZCU102 FPGA boards. Results demonstrate that Super-LIP with 2 FPGAs can achieve 3.48× speedup, compared to the state-of-the-art single-FPGA design. What is more, as the number of FPGAs scales up, the system latency can be further reduced while maintaining high energy efficiency.

[1]  Yiyu Shi,et al.  Resource constrained cellular neural networks for real-time obstacle detection using FPGAs , 2018, 2018 19th International Symposium on Quality Electronic Design (ISQED).

[2]  Edwin Hsing-Mean Sha,et al.  Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[3]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[4]  Kaiming He,et al.  Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[5]  Tulika Mitra,et al.  OPTiC: Optimizing Collaborative CPU–GPU Computing on Mobile Devices With Thermal Constraints , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[6]  Jakob Engblom,et al.  The worst-case execution-time problem—overview of methods and survey of tools , 2008, TECS.

[7]  Lei Yang,et al.  Optimal Application Mapping and Scheduling for Network-on-Chips with Computation in STT-RAM Based Router , 2019, IEEE Transactions on Computers.

[8]  WilhelmReinhard,et al.  The worst-case execution-time problemoverview of methods and survey of tools , 2008 .

[9]  Jinjun Xiong,et al.  DNNBuilder: an Automated Tool for Building High-Performance DNN Hardware Accelerators for FPGAs , 2018, 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[10]  Yi Wang,et al.  Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture , 2019, IEEE Transactions on Parallel and Distributed Systems.

[11]  Chen Yang,et al.  FPDeep: Acceleration and Load Balancing of CNN Training on FPGA Clusters , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12]  Junzhong Shen,et al.  Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[13]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[14]  Song Han,et al.  Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware , 2016, 2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI).

[15]  Ali Farhadi,et al.  You Only Look Once: Unified, Real-Time Object Detection , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Peng Chen,et al.  Task mapping on SMART NoC: Contention matters, not the distance , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[17]  Yi Wang,et al.  Towards Cross-Platform Inference on Edge Devices with Emerging Neuromorphic Architecture , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[18]  Nong Xiao,et al.  An Efficient Mapping Approach to Large-Scale DNNs on Multi-FPGA Architectures , 2019, 2019 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[19]  Jason Cong,et al.  Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster , 2016, ISLPED.

[20]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[21]  Soheil Ghiasi,et al.  Cappuccino: Efficient CNN Inference Software Synthesis for Mobile System-on-Chips , 2019, IEEE Embedded Systems Letters.

[22]  Yi Wang,et al.  Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture , 2018, IEEE Transactions on Parallel and Distributed Systems.

[23]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[24]  Yong Wang,et al.  SDA: Software-defined accelerator for large-scale DNN systems , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[25]  Lei Yang,et al.  Accuracy vs. Efficiency: Achieving Both through FPGA-Implementation Aware Neural Architecture Search , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[26]  Edwin Hsing-Mean Sha,et al.  FoToNoC: A Folded Torus-Like Network-on-Chip Based Many-Core Systems-on-Chip in the Dark Silicon Era , 2017, IEEE Transactions on Parallel and Distributed Systems.

[27]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[28]  Yiyu Shi,et al.  Hardware/Software Co-Exploration of Neural Architectures , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[29]  Jason Cong,et al.  Scaling for edge inference of deep neural networks , 2018 .

[30]  Jinjun Xiong,et al.  On the Universal Approximability and Complexity Bounds of Quantized ReLU Neural Networks , 2018, ICLR.

[31]  Junzhong Shen,et al.  Accelerating 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System , 2019, FPGA.

[32]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[33]  Mert R. Sabuncu,et al.  An Unsupervised Learning Model for Deformable Medical Image Registration , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[34]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[35]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[36]  Paramvir Bahl,et al.  Live Video Analytics at Scale with Approximation and Delay-Tolerance , 2017, NSDI.

[37]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[38]  Xiaobo Sharon Hu,et al.  Quantization of Fully Convolutional Networks for Accurate Biomedical Image Segmentation , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[39]  Michael Ferdman,et al.  Maximizing CNN accelerator efficiency through resource partitioning , 2016, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).