论文信息 - Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS

Skew-Oblivious Data Routing for Data Intensive Applications on FPGAs with HLS

FPGAs have become emerging computing infrastructures for accelerating applications in datacenters. Meanwhile, high-level synthesis (HLS) tools have been proposed to ease the programming of FPGAs. Even with HLS, irregular data-intensive applications require explicit optimizations, among which multiple processing elements (PEs) with each owning a private BRAM-based buffer are usually adopted to process multiple data per cycle. Data routing, which dynamically dispatches multiple data to designated PEs, avoids data replication in buffers compared to statically assigning data to PEs, hence saving BRAM usage. However, the workload imbalance among PEs vastly diminishes performance when processing skew datasets. In this paper, we propose a skew-oblivious data routing architecture that allocates secondary PEs and schedules them to share the workload of the overloaded PEs at run-time. In addition, we integrate the proposed architecture into a framework called Ditto to minimize the development efforts for applications that require skew handling. We evaluate Ditto on five commonly used applications: histogram building, data partitioning, pagerank, heavy hitter detection and hyperloglog. The results demonstrate that the generated implementations are robust to skew datasets and outperform the state-of-the-art designs in both throughput and BRAM usage efficiency.

[1] Jason Cong,et al. Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[2] Aoying Zhou,et al. Parallel Stream Processing Against Workload Skewness and Variance , 2017, HPDC.

[3] Yao Wang,et al. Aggressive pipelining of irregular applications on reconfigurable hardware , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[4] George A. Constantinides,et al. A Case for Work-stealing on FPGAs with OpenCL Atomics , 2016, FPGA.

[5] Viktor K. Prasanna,et al. HitGraph: High-throughput Graph Processing Framework on FPGA , 2019, IEEE Transactions on Parallel and Distributed Systems.

[6] Peng Zhang. Automated Accelerator Generation and Optimization with Composable, Parallel and Pipeline Architecture , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[7] Jason Cong,et al. ST-Accel: A High-Level Programming Platform for Streaming Applications on FPGA , 2018, 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[8] S. Reinhardt,et al. AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing , 2019, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] Bingsheng He,et al. ThunderGP: HLS-based Graph Processing Framework on FPGAs , 2021, FPGA.

[10] Pat Hanrahan,et al. Fleet: A Framework for Massively Parallel Streaming on FPGAs , 2020, ASPLOS.

[11] Wei Zhang,et al. Melia: A MapReduce Framework on OpenCL-Based FPGAs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[12] Bingsheng He,et al. On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs , 2019, 2019 29th International Conference on Field Programmable Logic and Applications (FPL).

[13] Eric S. Chung,et al. A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication , 2014, 2014 IEEE 22nd Annual International Symposium on Field-Programmable Custom Computing Machines.

[14] Bingsheng He,et al. Multikernel Data Partitioning With Channel on OpenCL-Based FPGAs , 2017, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[15] Ruben Mayer,et al. A Comprehensive Survey on Parallelization and Elasticity in Stream Processing , 2019, ACM Comput. Surv..

[16] Thomas B. Preußer,et al. HyperLogLog Sketch Acceleration on FPGA , 2020, 2020 30th International Conference on Field-Programmable Logic and Applications (FPL).

[17] Gustavo Alonso,et al. Main-memory hash joins on multi-core CPUs: Tuning to the underlying hardware , 2012, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[18] Ryan A. Rossi,et al. The Network Data Repository with Interactive Graph Analytics and Visualization , 2015, AAAI.

[19] Gustavo Alonso,et al. FPGA-based Data Partitioning , 2017, SIGMOD Conference.

[20] Viktor K. Prasanna,et al. High Throughput Sketch Based Online Heavy Hitter Detection on FPGA , 2016, SIGARCH Comput. Archit. News.

[21] Yu Ting Chen,et al. A Survey and Evaluation of FPGA High-Level Synthesis Tools , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[22] Hui Yan,et al. Constructing Concurrent Data Structures on FPGA with Channels , 2019, FPGA.

[23] Onur Mutlu,et al. Boyi: A Systematic Framework for Automatically Deciding the Right Execution Model of OpenCL Applications on FPGAs , 2020, FPGA.

[24] Bingsheng He,et al. Is FPGA Useful for Hash Joins? , 2020, CIDR.