Shuhai: Benchmarking High Bandwidth Memory On FPGAS

FPGAs are starting to be enhanced with High Bandwidth Memory (HBM) as a way to reduce the memory bandwidth bottleneck encountered in some applications and to give the FPGA more capacity to deal with application state. However, the performance characteristics of HBM are still not well specified, especially in the context of FPGAs. In this paper, we bridge the gap between nominal specifications and actual performance by benchmarking HBM on a state-of-the-art FPGA, i.e., a Xilinx Alveo U280 featuring a two-stack HBM subsystem. To this end, we propose Shuhai, a benchmarking tool that allows us to demystify all the underlying details of HBM on an FPGA. FPGA-based benchmarking should also provide a more accurate picture of HBM than doing so on CPUs/GPUs, since CPUs/GPUs are noisier systems due to their complex control logic and cache hierarchy. Since the memory itself is complex, leveraging custom hardware logic to benchmark inside an FPGA provides more details as well as accurate and deterministic measurements. We observe that 1) HBM is able to provide up to 425 GB/s memory bandwidth, and 2) how HBM is used has a significant impact on performance, which in turn demonstrates the importance of unveiling the performance characteristics of HBM so as to select the best approach. Shuhai can be easily generalized to other FPGA boards or other generations of memory, e.g., HBM3, and DDR3. We will make Shuhai open-source, benefiting the community.

[1]  Karin Strauss,et al.  A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[2]  Wei Zhang,et al.  Melia: A MapReduce Framework on OpenCL-Based FPGAs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[3]  James Demmel,et al.  Scaling Deep Learning on GPU and Knights Landing clusters , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Hao Wang,et al.  Exploring and Analyzing the Real Impact of Modern On-Package Memory on HPC Scientific Kernels , 2017, SC17: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Christos-Savvas Bouganis,et al.  fpgaConvNet: A Framework for Mapping Convolutional Neural Networks on FPGAs , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[6]  Martin L. Kersten,et al.  Generic Database Cost Models for Hierarchical Memory Systems , 2002, VLDB.

[7]  Pingfan Meng,et al.  Spector: An OpenCL FPGA benchmark suite , 2016, 2016 International Conference on Field-Programmable Technology (FPT).

[8]  Jason Cong,et al.  In-Depth Analysis on Microarchitectures of Modern Heterogeneous CPU-FPGA Platforms , 2019, ACM Trans. Reconfigurable Technol. Syst..

[9]  Gustavo Alonso,et al.  BiS-KM: Enabling Any-Precision K-Means on FPGAs , 2020, FPGA.

[10]  Bingsheng He,et al.  Deploying Hash Tables on Die-Stacked High Bandwidth Memory , 2019, CIKM.

[11]  Qiuwen Lou,et al.  Design Flow of Accelerating Hybrid Extremely Low Bit-Width Neural Network in Embedded FPGA , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).

[12]  Paolo Ienne,et al.  Efficient synthesis of compressor trees on FPGAs , 2008, 2008 Asia and South Pacific Design Automation Conference.

[13]  George A. Constantinides,et al.  A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[14]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[15]  S. Hauck,et al.  A Model for Programming Data-Intensive Applications on FPGAs: A Genomics Case Study , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[16]  Tom Drummond,et al.  FPGA acceleration of multilevel ORB feature extraction for computer vision , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[17]  Constantin Pohl,et al.  Joins in a heterogeneous memory hierarchy: exploiting high-bandwidth memory , 2018, DaMoN.

[18]  Bérenger Bramas,et al.  Fast Sorting Algorithms using AVX-512 on Intel Knights Landing , 2017, ArXiv.

[19]  Hongyu Miao,et al.  StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory , 2019, ASPLOS.

[20]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[21]  Alexander V. Veidenbaum,et al.  AFFIX: Automatic Acceleration Framework for FPGA Implementation of OpenVX Vision Algorithms , 2019, FPGA.

[22]  Paolo Ienne,et al.  Stop Crying Over Your Cache Miss Rate: Handling Efficiently Thousands of Outstanding Misses in FPGAs , 2019, FPGA.

[23]  Gokcen Kestor,et al.  Exploring the Performance Benefit of Hybrid Memory System on HPC Environments , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[24]  William J. Dally,et al.  Fine-Grained DRAM: Energy-Efficient DRAM for Extreme Bandwidth Systems , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Christophe Bobda,et al.  Transparent Acceleration of Image Processing Kernels on FPGA-Attached Hybrid Memory Cube Computers , 2018, 2018 International Conference on Field-Programmable Technology (FPT).

[26]  Gustavo Alonso,et al.  Runtime Parameterizable Regular Expression Operators for Databases , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[27]  Ming Liu,et al.  A transport-layer network for distributed FPGA platforms , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[28]  Syed Waqar Nabi,et al.  Smart-Cache: Optimising Memory Accesses for Arbitrary Boundaries and Stencils on FPGAs , 2019, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[29]  Gustavo Alonso,et al.  Lowering the Latency of Data Processing Pipelines Through FPGA based Hardware Acceleration , 2019, Proc. VLDB Endow..

[30]  Wei Zhang,et al.  A study of data partitioning on OpenCL-based FPGAs , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[31]  Gustavo Alonso,et al.  Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning , 2019, Proc. VLDB Endow..

[32]  Wayne Luk,et al.  A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation , 2009, FPGA '09.

[33]  Dirk Koch,et al.  Unexpected Diversity: Quantitative Memory Analysis for Zynq UltraScale+ Systems , 2019, 2019 International Conference on Field-Programmable Technology (ICFPT).

[34]  Joungho Kim,et al.  Design optimization of high bandwidth memory (HBM) interposer considering signal integrity , 2015, 2015 IEEE Electrical Design of Advanced Packaging and Systems Symposium (EDAPS).

[35]  Jing Li,et al.  Accelerating Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform , 2018, FPGA.

[36]  Keith Kim,et al.  HBM (High Bandwidth Memory) DRAM Technology and Architecture , 2017, 2017 IEEE International Memory Workshop (IMW).

[37]  Gustavo Alonso,et al.  Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning , 2019, Proceedings of the VLDB Endowment.

[38]  Joe Macri,et al.  AMD's next generation GPU and high bandwidth memory architecture: FURY , 2015, 2015 IEEE Hot Chips 27 Symposium (HCS).

[39]  Bingsheng He,et al.  Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[40]  Wei Zhang,et al.  Relational query processing on OpenCL-based FPGAs , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[41]  James C. Hoe,et al.  A Study of Pointer-Chasing Performance on Shared-Memory Processor-FPGA Systems , 2016, FPGA.

[42]  Martin C. Herbordt,et al.  GhostSZ: A Transparent FPGA-Accelerated Lossy Compression Framework , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[43]  Jason Cong,et al.  A quantitative analysis on microarchitectures of modern CPU-FPGA platforms , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).

[44]  Hamid Reza Zohouri,et al.  The Memory Controller Wall: Benchmarking the Intel FPGA SDK for OpenCL Memory Interface , 2019, 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC).

[45]  Yong Dou,et al.  An FPGA-based processor for training convolutional neural networks , 2017, 2017 International Conference on Field Programmable Technology (ICFPT).

[46]  Syed Waqar Nabi,et al.  MP-STREAM: A Memory Performance Benchmark for Design Space Exploration on Heterogeneous HPC Devices , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).