High Bandwidth Memory on FPGAs: A Data Analytics Perspective

FPGA-based data processing in datacenters is increasing in popularity due to the demands of modern workloads and the resulting need for specialization in hardware. Driven by this trend, vendors are rapidly adapting reconfigurable devices to suit data and compute intensive workloads. Inclusion of High Bandwidth Memory (HBM) in FPGA devices is a recent example. HBM promises overcoming the bandwidth bottleneck, often faced by FPGA-based accelerators due to their throughput oriented design. In this paper, we study the usage and benefits of HBM on FPGAs from a data analytics perspective. We consider three workloads that are often performed in analytics oriented databases and implement them on FPGA showing in which cases they benefit from HBM: range selection, hash join, and stochastic gradient descent for linear model training. We integrate our designs into a columnar database (MonetDB) and show the trade-offs arising from the integration related to data movement and partitioning. In certain cases, FPGA+HBM based solutions are able to surpass the highest performance provided by either a 2-socket POWER9 system or a 14-core XeonE5 by up to 1.8x (selection), 12.9x (join), and 3.2x (SGD).

[1]  Kenji Kise,et al.  High-Performance Hardware Merge Sorter , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[2]  Gustavo Alonso,et al.  ColumnML: Column-Store Machine Learning with On-The-Fly Data Transformation , 2018, Proc. VLDB Endow..

[3]  Hari Angepat,et al.  Serving DNNs in Real Time at Datacenter Scale with Project Brainwave , 2018, IEEE Micro.

[4]  Bingsheng He,et al.  Deploying Hash Tables on Die-Stacked High Bandwidth Memory , 2019, CIKM.

[5]  Yen-Chen Liu,et al.  Knights Landing: Second-Generation Intel Xeon Phi Product , 2016, IEEE Micro.

[6]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[7]  Joo-Young Kim,et al.  A Scalable High-Bandwidth Architecture for Lossless Compression on FPGAs , 2015, 2015 IEEE 23rd Annual International Symposium on Field-Programmable Custom Computing Machines.

[8]  Luca Benini,et al.  Design space exploration for 3D-stacked DRAMs , 2011, 2011 Design, Automation & Test in Europe.

[9]  Philip Heng Wai Leong,et al.  FINN: A Framework for Fast, Scalable Binarized Neural Network Inference , 2016, FPGA.

[10]  Muhsen Owaida,et al.  doppioDB 1.0: Machine Learning inside a Relational Engine , 2019, IEEE Data Eng. Bull..

[11]  Gustavo Alonso,et al.  FPGA-Accelerated Dense Linear Machine Learning: A Precision-Convergence Trade-Off , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[14]  Ce Zhang,et al.  doppioDB 2.0: Hardware Techniques for Improved Integration of Machine Learning into Databases , 2019, Proc. VLDB Endow..

[15]  Elkin Garcia,et al.  A Reconfigurable Computing System Based on a Cache-Coherent Fabric , 2011, 2011 International Conference on Reconfigurable Computing and FPGAs.

[16]  Susie Stephens,et al.  Oracle Data Mining , 2005 .

[17]  Hadi Esmaeilzadeh,et al.  In-RDBMS Hardware Acceleration of Advanced Analytics , 2018, Proc. VLDB Endow..

[18]  Gustavo Alonso,et al.  Consensus in a Box: Inexpensive Coordination in Hardware , 2016, NSDI.

[19]  Goetz Graefe,et al.  Joins on high-bandwidth memory: a new level in the memory hierarchy , 2019, The VLDB Journal.

[20]  Gustavo Alonso,et al.  FPGA-based Data Partitioning , 2017, SIGMOD Conference.

[21]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[22]  H. Peter Hofstee,et al.  In-memory database acceleration on FPGAs: a survey , 2019, The VLDB Journal.

[23]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[24]  Daniel M. Dreps,et al.  IBM POWER9 opens up a new era of acceleration enablement: OpenCAPI , 2018, IBM J. Res. Dev..

[25]  Hongyu Miao,et al.  StreamBox-HBM: Stream Analytics on High Bandwidth Hybrid Memory , 2019, ASPLOS.

[26]  John MacGregor Predictive Analysis with SAP: The Comprehensive Guide , 2013 .

[27]  Thomas Neumann,et al.  TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark , 2013, TPCTC.

[28]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.

[29]  Martin L. Kersten,et al.  MonetDB: Two Decades of Research in Column-oriented Database Architectures , 2012, IEEE Data Eng. Bull..

[30]  Gustavo Alonso,et al.  Limago: An FPGA-Based Open-Source 100 GbE TCP/IP Stack , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[31]  Gustavo Alonso,et al.  Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[32]  Gustavo Alonso,et al.  Fast and robust hashing for database operators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[33]  Peter Benjamin Volk,et al.  GPU join processing revisited , 2012, DaMoN '12.

[34]  Martin L. Kersten,et al.  Database Architecture Optimized for the New Bottleneck: Memory Access , 1999, VLDB.

[35]  Gustavo Alonso,et al.  Ibex - An Intelligent Storage Engine with Support for Advanced SQL Off-loading , 2014, Proc. VLDB Endow..

[36]  Gustavo Alonso,et al.  Accelerating Generalized Linear Models with MLWeaving: A One-Size-Fits-All System for Any-precision Learning , 2019, Proc. VLDB Endow..

[37]  Gustavo Alonso,et al.  Scalable inference of decision tree ensembles: Flexible design for CPU-FPGA platforms , 2017, 2017 27th International Conference on Field Programmable Logic and Applications (FPL).

[38]  Gustavo Alonso,et al.  Accelerating Pattern Matching Queries in Hybrid CPU-FPGA Architectures , 2017, SIGMOD Conference.

[39]  Jie Zhang,et al.  Shuhai: Benchmarking High Bandwidth Memory On FPGAS , 2020, 2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[40]  Gustavo Alonso,et al.  Sorting networks on FPGAs , 2012, The VLDB Journal.

[41]  Thomas Hofmann,et al.  Communication-Efficient Distributed Dual Coordinate Ascent , 2014, NIPS.

[42]  Wayne Luk,et al.  Accelerating Database Systems Using FPGAs: A Survey , 2018, 2018 28th International Conference on Field Programmable Logic and Applications (FPL).