Taxonomy of Spatial Parallelism on FPGAs for Massively Parallel Applications

OpenCL for FPGAs has emerged as an attractive solution for realizing massively parallel compute-intensive applications. It offers a customizable application-specific datapath while abstracting away hardware development complexity. Research on OpenCL for FPGAs is at early stages and many aspects such as the spatial parallelism matching with respect to OpenCL execution semantic has not been explored in detail. An in-depth understanding and formalization are required to enhance the efficiency of OpenCL written codes on FPGAs and improve the parallelism potentials to the fullest. This paper presents a comprehensive study to identify, analyze and categorize the spatial parallelism when mapping OpenCL kernels to the FPGAs. The paper studies and explores the impact of Data-Path (DP) replication and Compute Unit (CU) replication on performance and power efficiency of OpenCL execution on FPGAs. To this end, this paper proposes a generic taxonomy for classifying spatial parallelism when mapping OpenCL to FPGAs. This results in developing FPGA-aware OpenCL codes that can achieve much higher efficiency over a baseline implementation. Our experimental results on Altera Stratix-V FPGA device for eight applications of Rodinia benchmarks demonstrate that FPGA-aware OpenCL codes achieve 3.4X, 2.2X and 2.6X performance improvement on average for SCU-MDP, MCU-SDP, and MCU-MDP versions over SCU-SDP as the baseline implementation. Furthermore, we compare the performance and power efficiency against AMD FirePro W7100 GPU. Our results demonstrate that benchmarks with regular execution patterns can outperform GPUs, achieving much higher performance per watt. Furthermore, OpenCL source-code decisions that can exploit spatial parallelism will be able to hide the memory access latency and thus result in a higher speedup.

[1]  Kenta Kasai,et al.  Flexible non-binary LDPC decoding on FPGAs , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Sheng-De Wang,et al.  OpenCL computing on FPGA using multiported , 2015, 2015 25th International Conference on Field Programmable Logic and Applications (FPL).

[3]  David R. Kaeli,et al.  Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[4]  David R. Kaeli,et al.  Hardware thread reordering to boost OpenCL throughput on FPGAs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[5]  Doris Chen,et al.  Fractal video compression in OpenCL: An evaluation of CPUs, GPUs, and FPGAs as acceleration platforms , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[6]  Ralph Duncan A survey of parallel computer architectures , 1990, Computer.

[7]  Mehdi Baradaran Tahoori,et al.  Energy Efficient Scientific Computing on FPGAs using OpenCL , 2017, FPGA.

[8]  Satoshi Matsuoka,et al.  Evaluating and Optimizing OpenCL Kernels for High Performance Computing with FPGAs , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  David R. Kaeli,et al.  Exploring the Efficiency of the OpenCL Pipe Semantic on an FPGA , 2016, SIGARCH Comput. Archit. News.

[10]  Wu-chun Feng,et al.  On the performance and energy efficiency of FPGAs and GPUs for polyphase channelization , 2014, 2014 International Conference on ReConFigurable Computing and FPGAs (ReConFig14).

[11]  Gregory D. Peterson,et al.  Performance Comparison of Cholesky Decomposition on GPUs and FPGAs , 2011 .

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Bingsheng He,et al.  Revisiting Co-Processing for Hash Joins on the Coupled CPU-GPU Architecture , 2013, Proc. VLDB Endow..

[14]  Vijay Janapa Reddi,et al.  PIN: a binary instrumentation tool for computer architecture research and education , 2004, WCAE '04.

[15]  Martin Margala,et al.  High level programming of FPGAs for HPC and data centric applications , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[16]  David B. Skillicorn A taxonomy for computer architectures , 1988, Computer.

[17]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[18]  Jungwon Kim,et al.  OpenACC to FPGA: A Framework for Directive-Based High-Performance Reconfigurable Computing , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[19]  David R. Kaeli,et al.  HQL: A Scalable Synchronization Mechanism for GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[20]  Bingsheng He,et al.  In-Cache Query Co-Processing on Coupled CPU-GPU Architectures , 2014, Proc. VLDB Endow..