论文信息 - LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

Abstract The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization. This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels. The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%.

[1] James C. Hoe,et al. Automatic multithreaded pipeline synthesis from transactional datapath specifications , 2010, Design Automation Conference.

[2] Santosh G. Abraham,et al. Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[3] Mehdi Baradaran Tahoori,et al. Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.

[4] Gunar Schirner,et al. DS-DSE: Domain-specific design space exploration for streaming applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5] David J. Lilja,et al. Data prefetch mechanisms , 2000, CSUR.

[6] David R. Kaeli,et al. Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGA , 2016, SIGARCH Comput. Archit. News.

[7] Andrew C. Ling,et al. An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[8] Tao Chen,et al. Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9] Peter Y. K. Cheung,et al. Outer Loop Pipelining for Application Specific Datapaths in FPGAs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10] Zhiru Zhang,et al. ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11] Hamed Tabkhi,et al. Locality Aware Memory Assignment and Tiling , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[12] Gunar Schirner,et al. Function-Level Processor (FLP): Raising efficiency by operating at function granularity for market-oriented MPSoC , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[13] Sotirios G. Ziavras,et al. Customized kernel execution on reconfigurable hardware for embedded applications , 2009, Microprocess. Microsystems.

[14] Alexander V. Veidenbaum,et al. Multiple stream tracker: a new hardware stride prefetcher , 2014, Conf. Computing Frontiers.

[15] Satoshi Matsuoka,et al. Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[17] Seth H. Pugsley,et al. Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Vikram S. Adve,et al. LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[19] Wei Zhang,et al. A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20] Wei Zhang,et al. FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[21] Pingfan Meng,et al. Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[22] Jason Cong,et al. Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[23] David Bernstein,et al. Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[24] Robert J. Halstead,et al. Exploring irregular memory accesses on FPGAs , 2011, IA3 '11.

[25] Jianbin Fang,et al. A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[26] Shankar Balachandran,et al. Hardware prefetchers for emerging parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27] Vijay Janapa Reddi,et al. PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[28] Martin Margala,et al. High level programming of FPGAs for HPC and data centric applications , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[29] Hongbo Rong,et al. Single-dimension software pipelining for multi-dimensional loops , 2004 .

[30] Robert J. Halstead,et al. Compiled multithreaded data paths on FPGAs for dynamic workloads , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[31] Ronald Tetzlaff,et al. A new high-speed real-time video processing platform , 2014, 2014 14th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA).

[32] John Wawrzynek,et al. Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[33] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[34] Henry M. Levy,et al. An architecture for software-controlled data prefetching , 1991, ISCA '91.

[35] Yu Wang,et al. A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[36] Hamed Tabkhi,et al. Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications , 2018, 2018 31st IEEE International System-on-Chip Conference (SOCC).

[37] Jungwon Kim,et al. OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38] Jason Helge Anderson,et al. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[39] Michael J. Flynn,et al. Hardware and software cache prefetching techniques for MPEG benchmarks , 2000, IEEE Trans. Circuits Syst. Video Technol..

[40] Wei-Chung Hsu,et al. The performance of runtime data cache prefetching in a dynamic optimization system , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[41] Kenta Kasai,et al. Flexible non-binary LDPC decoding on FPGAs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42] Peng Zhang,et al. Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[43] Collin McCurdy,et al. Diagnosis and optimization of application prefetching performance , 2013, ICS '13.

[44] Jason Cong,et al. An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[45] Doris Chen,et al. Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[46] Torsten Hoefler,et al. Scientiﬁc Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[47] Jason Cong,et al. Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) , 2018, FPGA.

[48] Jing Li,et al. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[49] Tomasz Kryjak,et al. Real-time hardware–software embedded vision system for ITS smart camera implemented in Zynq SoC , 2018, Journal of Real-Time Image Processing.

[50] Zhiru Zhang,et al. Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[51] David R. Kaeli,et al. Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.