LLVM-based automation of memory decoupling for OpenCL applications on FPGAs

Abstract The availability of OpenCL High-Level Synthesis (OpenCL-HLS) has made FPGAs an attractive platform for power-efficient high-performance execution of massively parallel applications. At the same time, new design challenges emerge for massive thread-level parallelism on FPGAs. One major execution bottleneck is the high number of memory stalls exposed to data-path which overshadows the benefits of data-path customization. This article presents a novel LLVM-based tool for decoupling memory access from computation when synthesizing massively parallel OpenCL kernels on FPGAs. To enable systematic decoupling, we use the idea of kernel parallelism and implement a new parallelism granularity that breaks down kernels to separate data-path and memory-path (memory read/write) which work concurrently to overlap the computation of current threads[1] with the memory access of future threads (memory pre-fetching at large scale). At the same time, this paper proposes an LLVM-based static analysis to detect the decouplable data for resolving the data dependency and maximize concurrency across the kernels. The experimental results on eight Rodinia benchmarks on Intel Stratix V FPGA demonstrate significant performance and energy improvement over the baseline implementation using Intel OpenCL SDK. The proposed sub-kernel parallelism achieves more than 2x speedup, with only 3% increase in resource utilization, and 7% increase in power consumption which reduces the overall energy consumption more than 40%.

[1]  James C. Hoe,et al.  Automatic multithreaded pipeline synthesis from transactional datapath specifications , 2010, Design Automation Conference.

[2]  Santosh G. Abraham,et al.  Effective stream-based and execution-based data prefetching , 2004, ICS '04.

[3]  Mehdi Baradaran Tahoori,et al.  Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.

[4]  Gunar Schirner,et al.  DS-DSE: Domain-specific design space exploration for streaming applications , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[5]  David J. Lilja,et al.  Data prefetch mechanisms , 2000, CSUR.

[6]  David R. Kaeli,et al.  Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGA , 2016, SIGARCH Comput. Archit. News.

[7]  Andrew C. Ling,et al.  An OpenCL(TM) Deep Learning Accelerator on Arria 10 , 2017 .

[8]  Tao Chen,et al.  Efficient data supply for hardware accelerators with prefetching and access/execute decoupling , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Peter Y. K. Cheung,et al.  Outer Loop Pipelining for Application Specific Datapaths in FPGAs , 2008, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  Zhiru Zhang,et al.  ElasticFlow: A complexity-effective approach for pipelining irregular loop nests , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[11]  Hamed Tabkhi,et al.  Locality Aware Memory Assignment and Tiling , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[12]  Gunar Schirner,et al.  Function-Level Processor (FLP): Raising efficiency by operating at function granularity for market-oriented MPSoC , 2014, 2014 IEEE 25th International Conference on Application-Specific Systems, Architectures and Processors.

[13]  Sotirios G. Ziavras,et al.  Customized kernel execution on reconfigurable hardware for embedded applications , 2009, Microprocess. Microsystems.

[14]  Alexander V. Veidenbaum,et al.  Multiple stream tracker: a new hardware stride prefetcher , 2014, Conf. Computing Frontiers.

[15]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Norman P. Jouppi,et al.  Improving direct-mapped cache performance by the addition of a small fully-associative cache and pre , 1990, ISCA 1990.

[17]  Seth H. Pugsley,et al.  Efficiently prefetching complex address patterns , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[19]  Wei Zhang,et al.  A performance analysis framework for optimizing OpenCL applications on FPGAs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[20]  Wei Zhang,et al.  FlexCL: An analytical performance model for OpenCL workloads on flexible FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[21]  Pingfan Meng,et al.  Real-time 3D reconstruction for FPGAs: A case study for evaluating the performance, area, and programmability trade-offs of the Altera OpenCL SDK , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[22]  Jason Cong,et al.  Bandwidth optimization through on-chip memory restructuring for HLS , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[23]  David Bernstein,et al.  Compiler techniques for data prefetching on the PowerPC , 1995, PACT.

[24]  Robert J. Halstead,et al.  Exploring irregular memory accesses on FPGAs , 2011, IA3 '11.

[25]  Jianbin Fang,et al.  A Comprehensive Performance Comparison of CUDA and OpenCL , 2011, 2011 International Conference on Parallel Processing.

[26]  Shankar Balachandran,et al.  Hardware prefetchers for emerging parallel applications , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[28]  Martin Margala,et al.  High level programming of FPGAs for HPC and data centric applications , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[29]  Hongbo Rong,et al.  Single-dimension software pipelining for multi-dimensional loops , 2004 .

[30]  Robert J. Halstead,et al.  Compiled multithreaded data paths on FPGAs for dynamic workloads , 2013, 2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES).

[31]  Ronald Tetzlaff,et al.  A new high-speed real-time video processing platform , 2014, 2014 14th International Workshop on Cellular Nanoscale Networks and their Applications (CNNA).

[32]  John Wawrzynek,et al.  Architectural synthesis of computational pipelines with decoupled memory access , 2014, 2014 International Conference on Field-Programmable Technology (FPT).

[33]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Henry M. Levy,et al.  An architecture for software-controlled data prefetching , 1991, ISCA '91.

[35]  Yu Wang,et al.  A Reconfigurable Computing Approach for Efficient and Scalable Parallel Graph Exploration , 2012, 2012 IEEE 23rd International Conference on Application-Specific Systems, Architectures and Processors.

[36]  Hamed Tabkhi,et al.  Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications , 2018, 2018 31st IEEE International System-on-Chip Conference (SOCC).

[37]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[38]  Jason Helge Anderson,et al.  LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems , 2013, TECS.

[39]  Michael J. Flynn,et al.  Hardware and software cache prefetching techniques for MPEG benchmarks , 2000, IEEE Trans. Circuits Syst. Video Technol..

[40]  Wei-Chung Hsu,et al.  The performance of runtime data cache prefetching in a dynamic optimization system , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[41]  Kenta Kasai,et al.  Flexible non-binary LDPC decoding on FPGAs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[42]  Peng Zhang,et al.  Automated systolic array architecture synthesis for high throughput CNN inference on FPGAs , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[43]  Collin McCurdy,et al.  Diagnosis and optimization of application prefetching performance , 2013, ICS '13.

[44]  Jason Cong,et al.  An Optimal Microarchitecture for Stencil Computation Acceleration Based on Nonuniform Partitioning of Data Reuse Buffers , 2016, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[45]  Doris Chen,et al.  Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[46]  Torsten Hoefler,et al.  Scientific Benchmarking of Parallel Computing Systems Twelve ways to tell the masses when reporting performance results , 2017 .

[47]  Jason Cong,et al.  Understanding Performance Differences of FPGAs and GPUs: (Abtract Only) , 2018, FPGA.

[48]  Jing Li,et al.  Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network , 2017, FPGA.

[49]  Tomasz Kryjak,et al.  Real-time hardware–software embedded vision system for ITS smart camera implemented in Zynq SoC , 2018, Journal of Real-Time Image Processing.

[50]  Zhiru Zhang,et al.  Multithreaded pipeline synthesis for data-parallel kernels , 2014, 2014 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[51]  David R. Kaeli,et al.  Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.