A Cost-Effective and Scalable Merge Sorter Tree on FPGAs

Sorting is an important computation kernel used in a lot of fields such as image processing, data compression, and database operation. There have been many attempts to accelerate sorting using FPGAs. Most of them are based on merge sort algorithm. Merge sorter trees are tree-structured architectures for large-scale sorting. If a merge sorter tree with K input leaves merges N elements, merge phases are performed recursively, so its time complexity is O(NlogK(N)). Hence, to achieve higher sorting performance, it is effective to increase the number of input leaves K. However, the hardware resource usage is O(K). It is difficult to efficiently implement a merge sorter tree with many input leaves. Ito et al. have recently proposed an algorithm which can reduce the hardware complexity of a merge sorter tree with K input leaves from O(K) to O(log(K)). However, they only report the evaluation results when K is 8 and 16. In this paper, we propose a cost-effective and scalable merge sorter tree architecture based on their algorithm. We show that our design achieves almost the same performance compared to the conventional design of which the hardware complexity is O(K). We implement a merge sorter tree with 1,024 input leaves on a Xilinx XC7VX485T-2 FPGA and show that the proposed architecture has 52.4x better logic slice utilization with only 1.31x performance degradation compared with the conventional design. We succeed in implementing a very large merge sorter tree with 4,096 input leaves which cannot be implemented using the conventional design. This tree achieves a merging throughput of 149 million 64-bit elements per second while using 1.72% of slices and 7.48% of Block RAMs of the FPGA.

[1]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[2]  Kenji Kise,et al.  Cost-Effective and High-Throughput Merge Network: Architecture for the Fastest FPGA Sorting Accelerator , 2017, CARN.

[3]  Viktor K. Prasanna,et al.  A hybrid design for high performance large-scale sorting on FPGA , 2015, 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig).

[4]  Kenji Kise,et al.  FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core Systems , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[5]  Jan Korenek,et al.  Network monitoring probe based on Xilinx Zynq , 2014, 2014 ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS).

[6]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[7]  Koji Nakano,et al.  Optimal Parallel Hardware K-Sorter and Top K-Sorter, with FPGA Implementations , 2015, 2015 14th International Symposium on Parallel and Distributed Computing.

[8]  Abdelwahab Hamou-Lhadj,et al.  On-device anomaly detection for resource-limited systems , 2015, SAC.

[9]  Jim Tørresen,et al.  FPGASort: a high performance sorting architecture exploiting run-time reconfiguration on fpgas for large problem sorting , 2011, FPGA '11.

[10]  Feifei Li,et al.  Fixed-function hardware sorting accelerators for near data MapReduce execution , 2015, 2015 33rd IEEE International Conference on Computer Design (ICCD).

[11]  Jens Teubner,et al.  Data Processing on FPGAs , 2013, Proc. VLDB Endow..

[12]  Jim D. Garside,et al.  Parallel Hardware Merge Sorter , 2016, 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[13]  Kenneth E. Batcher,et al.  Sorting networks and their applications , 1968, AFIPS Spring Joint Computing Conference.

[14]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[15]  Kunle Olukotun,et al.  Hardware acceleration of database operations , 2014, FPGA.