Bonsai: High-Performance Adaptive Merge Tree Sorting

Sorting is a key computational kernel in many big data applications. Most sorting implementations focus on a specific input size, record width, and hardware configuration. This has created a wide array of sorters that are optimized only to a narrow application domain.In this work we show that merge trees can be implemented on FPGAs to offer state-of-the-art performance over many problem sizes. We introduce a novel merge tree architecture and develop Bonsai, an adaptive sorting solution that takes into consideration the off-chip memory bandwidth and the amount of on-chip resources to optimize sorting time. FPGA programmability allows us to leverage Bonsai to quickly implement the optimal merge tree configuration for any problem size and memory hierarchy.Using Bonsai, we develop a state-of-the-art sorter which specifically targets DRAM-scale sorting on AWS EC2 F1 instances. For 4-32 GB array size, our implementation has a minimum of 2.3x, 1.3x, 1.2x and up to 2.5x, 3.7x, 1.3x speedup over the best designs on CPUs, FPGAs, and GPUs, respectively. Our design exhibits 3.3x better bandwidth-efficiency compared to the best previous sorting implementations. Finally, we demonstrate that Bonsai can tune our design over a wide range of problem sizes(megabyte to terabyte) and memory hierarchies including DDR DRAMs, high-bandwidth memories (HBMs) and solid-state disks (SSDs).

[1]  Gustavo Alonso,et al.  Distributed Join Algorithms on Thousands of Cores , 2017, Proc. VLDB Endow..

[2]  Rajeev Rastogi,et al.  Main-memory index structures with fixed-size partial keys , 2001, SIGMOD '01.

[3]  Hans-Arno Jacobsen,et al.  A Memory Bandwidth-Efficient Hybrid Radix Sort on GPUs , 2017, SIGMOD Conference.

[4]  Pradeep Dubey,et al.  Efficient implementation of sorting on multi-core SIMD CPU architecture , 2008, Proc. VLDB Endow..

[5]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[6]  Bharat Sukhwani,et al.  ConTutto – A Novel FPGA-based Prototyping Platform Enabling Innovation in the Memory Subsystem of a Server Class Processor , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[7]  Satoshi Matsuoka,et al.  GPU-Accelerated Large-Scale Distributed Sorting Coping with Device Memory Capacity , 2016, IEEE Transactions on Big Data.

[8]  Asif Khan,et al.  High-throughput Pipelined Mergesort , 2008, 2008 6th ACM/IEEE International Conference on Formal Methods and Models for Co-Design.

[9]  Markus Püschel,et al.  Computer generation of streaming sorting networks , 2012, DAC Design Automation Conference 2012.

[10]  Viktor K. Prasanna,et al.  Energy and Memory Efficient Mapping of Bitonic Sorting on FPGA , 2015, FPGA.

[11]  Arvind,et al.  Terabyte Sort on FPGA-Accelerated Flash Storage , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[12]  Hadi Esmaeilzadeh,et al.  Scale-Out Acceleration for Machine Learning , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Henri Casanova,et al.  Analysis-driven Engineering of Comparison-based Sorting Algorithms on GPUs , 2018, ICS.

[14]  Zheng Shao,et al.  Hive - a petabyte scale data warehouse using Hadoop , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[15]  Jason Cong,et al.  High-Throughput Lossless Compression on Tightly Coupled CPU-FPGA Platforms , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[16]  Guy E. Blelloch,et al.  A comparison of sorting algorithms for the connection machine CM-2 , 1991, SPAA '91.

[17]  A. Grimshaw,et al.  High Performance and Scalable Radix Sorting: a Case Study of Implementing Dynamic Parallelism for GPU Computing , 2011, Parallel Process. Lett..

[18]  Hari Angepat,et al.  A cloud-scale acceleration architecture , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Amin Farmahini Farahani,et al.  Modular Design of High-Throughput, Low-Latency Sorting Units , 2013, IEEE Transactions on Computers.

[20]  Joseph K. Bradley,et al.  Spark SQL: Relational Data Processing in Spark , 2015, SIGMOD Conference.

[21]  Mrinmoy Ghosh,et al.  Performance analysis of NVMe SSDs and their implication on real world databases , 2015, SYSTOR.

[22]  Ryan Kastner,et al.  Resolve: Generation of High-Performance Sorting Architectures from High-Level Synthesis , 2016, FPGA.

[23]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[24]  Hiroshi Inoue,et al.  SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures , 2015, Proc. VLDB Endow..

[25]  Alok Aggarwal,et al.  The input/output complexity of sorting and related problems , 1988, CACM.

[26]  Milos D. Ercegovac,et al.  Digital Arithmetic , 2003, Wiley Encyclopedia of Computer Science and Engineering.

[27]  Sungjin Lee,et al.  BlueDBM: An appliance for Big Data analytics , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[28]  Parthasarathy Ranganathan,et al.  Kelp: QoS for Accelerated Machine Learning Systems , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Michael Ferdman,et al.  Sorting Large Data Sets with FPGA-Accelerated Samplesort , 2019, 2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[30]  Jason Cong,et al.  Caffeine: Toward Uniformed Representation and Acceleration for Deep Convolutional Neural Networks , 2019, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[31]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[32]  Chi-Ying Tsui,et al.  Saving power in the control path of embedded processors , 1994, IEEE Design & Test of Computers.

[33]  Scott Hauck,et al.  Performance of partial reconfiguration in FPGA systems: A survey and a cost model , 2011, TRETS.

[34]  Horácio C. Neto,et al.  Unbalanced FIFO sorting for FPGA-based systems , 2009, 2009 16th IEEE International Conference on Electronics, Circuits and Systems - (ICECS 2009).

[35]  Michael Garland,et al.  Designing efficient sorting algorithms for manycore GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[36]  David A. Patterson,et al.  FPGA Accelerated INDEL Realignment in the Cloud , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[37]  Kenji Kise,et al.  A High-Performance and Cost-Effective Hardware Merge Sorter without Feedback Datapath , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[38]  Daniel Brand,et al.  PARADIS: An Efficient Parallel Algorithm for In-place Radix Sort , 2015, Proc. VLDB Endow..

[39]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[40]  Hadi Esmaeilzadeh,et al.  TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[41]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[42]  Gustavo Alonso,et al.  Multi-Core, Main-Memory Joins: Sort vs. Hash Revisited , 2013, Proc. VLDB Endow..

[43]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[44]  Dirk Koch,et al.  Large Utility Sorting on FPGAs , 2018, 2018 International Conference on Field-Programmable Technology (FPT).