HBM Connect: High-Performance HLS Interconnect for FPGA HBM

With the recent release of High Bandwidth Memory (HBM) based FPGA boards, developers can now exploit unprecedented external memory bandwidth. This allows more memory-bounded applications to benefit from FPGA acceleration. However, fully utilizing the available bandwidth may not be an easy task. If an application requires multiple processing elements to access multiple HBM channels, we observed a significant drop in the effective bandwidth. The existing high-level synthesis (HLS) programming environment had limitation in producing an efficient communication architecture. In order to solve this problem, we propose HBM Connect, a high-performance customized interconnect for FPGA HBM board. Novel HLS-based optimization techniques are introduced to increase the throughput of AXI bus masters and switching elements. We also present a high-performance customized crossbar that may replace the built-in crossbar. The effectiveness of HBM Connect is demonstrated using Xilinx's Alveo U280 HBM board. Based on bucket sort and merge sort case studies, we explore several design spaces and find the design point with the best resource-performance trade-off. The result shows that HBM Connect improves the resource-performance metrics by 6.5X-211X.

[1]  StittGreg,et al.  A Tradeoff Analysis of FPGAs, GPUs, and Multicores for Sliding-Window Applications , 2015 .

[2]  Hai Jin,et al.  Optimizing Memory Performance of Xilinx FPGAs under Vitis , 2020, ArXiv.

[3]  Jason Cong,et al.  Understanding Performance Differences of FPGAs and GPUs , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[4]  Zhenman Fang,et al.  Demystifying the Memory System of Modern Datacenter FPGAs for Software Programmers through Microbenchmarking , 2021, FPGA.

[5]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Pedro C. Diniz,et al.  Performance and area modeling of complete FPGA designs in the presence of loop transformations , 2003, 11th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, 2003. FCCM 2003..

[7]  Jason Cong,et al.  In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms , 2019, ACM Trans. Reconfigurable Technol. Syst..

[8]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[9]  Wayne Luk,et al.  Performance Comparison of Graphics Processors to Reconfigurable Logic: A Case Study , 2010, IEEE Transactions on Computers.

[10]  Jason D. Bakos High-Performance Heterogeneous Computing with the Convey HC-1 , 2010, Computing in Science & Engineering.

[11]  Keith Kim,et al.  HBM (High Bandwidth Memory) DRAM Technology and Architecture , 2017, 2017 IEEE International Memory Workshop (IMW).

[12]  Eriko Nurvitadhi,et al.  Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Neural Networks? , 2017, FPGA.

[13]  Kenji Kise,et al.  A High-Performance and Cost-Effective Hardware Merge Sorter without Feedback Datapath , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[14]  Hongyu Miao,et al.  StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory , 2019, ASPLOS.

[15]  Jason Cong,et al.  When HLS Meets FPGA HBM: Benchmarking and Bandwidth Optimization , 2020, ArXiv.

[16]  Jie Zhang,et al.  Shuhai: Benchmarking High Bandwidth Memory On FPGAS , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[17]  Asif Khan,et al.  High-throughput Pipelined Mergesort , 2008, 2008 6th ACM/IEEE International Conference on Formal Methods and Models for Co-Design.

[18]  Jarno Vanne,et al.  Are We There Yet? A Study on the State of High-Level Synthesis , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[19]  Robert Schöne,et al.  Main memory and cache performance of intel sandy bridge and AMD bulldozer , 2014, MSPC@PLDI.

[20]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[21]  Peng Zhang,et al.  HLScope+,: Fast and accurate performance estimation for FPGA HLS , 2017, 2017 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[22]  Jason Cong,et al.  Bonsai: High-Performance Adaptive Merge Tree Sorting , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).