Scaling the Cascades: Interconnect-Aware FPGA Implementation of Machine Learning Problems

DSP48s, BRAMs and URAMs in the Xilinx Ultra-scale+ family support dedicated cascade interconnect for high frequency, nearest-neighbor data movement using hard wiring resources. We demonstrate how to leverage these interconnect structures to effectively support data movement requirements of dense machine learning (ML) workloads at URAM-limited 650MHz frequency (714MHz reported by Vivado). We refor-mulate convolution and matrix-vector multiplication operations to make effective use of cascade interconnect (1) in DSP48s for supporting the common multiply-accumulate chains, and (2) in BRAMs, and URAMs to exploit the data movement and reuse patterns of ML workloads. The use of these dedicated cascade interconnect are an alternative to Versal AI cores that throw away FPGA flexibility in favor of rigid ASIC components with unproven long-term value. Our 650MHz operation on the Xilinx VU37P UltraScale+ FPGA is competitive with the 720 MHz state-of-the-art Xilinx SuperTile design. We use 100% URAM288s, 95% DSP48s, and 77% BRAM in contrast to the 100% URAM288s, 56% DSP48, and 40% BRAM usage of the Xilinx SuperTile array. As a result, we deliver a ?7× superior GoogLeNet inference latency while sacrificing 30% of inference throughput than their design. For MLPerf benchmarks we note inference latencies between 2?s–1.54 ms with corresponding throughputs between 645–456K inf/s.

[1]  Xiaoqian Zhang,et al.  Compute-Efficient Neural-Network Acceleration , 2019, FPGA.

[2]  Vaughn Betz,et al.  Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[3]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[4]  Soheil Ghiasi,et al.  PLACID: A Platform for FPGA-Based Accelerator Creation for DCNNs , 2017, ACM Trans. Multim. Comput. Commun. Appl..

[5]  Christos-Savvas Bouganis,et al.  Latency-driven design for FPGA-based convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[6]  Yu Cao,et al.  An automatic RTL compiler for high-throughput FPGA implementation of diverse deep convolutional neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[7]  Inkeun Cho,et al.  A high-throughput reconfigurable processing array for neural networks , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[8]  M. Pelcat,et al.  Tactics to Directly Map CNN Graphs on Embedded FPGAs , 2017, IEEE Embedded Systems Letters.

[9]  Shijie Li,et al.  Throughput-Optimized FPGA Accelerator for Deep Convolutional Neural Networks , 2017, ACM Trans. Reconfigurable Technol. Syst..

[10]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[11]  Nachiket Kapre,et al.  Implementing FPGA Overlay NoCs Using the Xilinx UltraScale Memory Cascades , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[13]  Yu Cao,et al.  Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks , 2017, FPGA.

[14]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[15]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[16]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Jason Cong,et al.  Energy-Efficient CNN Implementation on a Deeply Pipelined FPGA Cluster , 2016, ISLPED.

[19]  Yu Cao,et al.  Scalable and modularized RTL compilation of Convolutional Neural Networks onto FPGA , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[20]  Xuegong Zhou,et al.  A high performance FPGA-based accelerator for large-scale convolutional neural networks , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[21]  Michael Ferdman,et al.  Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[22]  Vivienne Sze,et al.  Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[23]  Bin Liu,et al.  Ternary Weight Networks , 2016, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[25]  Kiyoung Choi,et al.  Efficient FPGA acceleration of Convolutional Neural Networks using logical-3D compute array , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[26]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[27]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[28]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[29]  Nick Mehta Pushing Performance and Integration with the UltraScale + Portfolio , 2015 .