Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enables simultaneous computing in addition to simple off loading. The FPGA device is programmed via OpenCL using the recent availability of commercial tools and hardware while Threading Building Blocks (TBB) is used to orchestrate the load distribution and balancing between FPGA and the multicore CPU. We focus on streaming applications that can be implemented as a pipeline of stages. We present an approach that allows the user to specify the mapping of the pipeline stages to the devices (FPGA, GPU or CPU) and the number of active threads. Using as a case study a real streaming application, we evaluate how these parameters affect the performance and energy efficiency using as reference a heterogeneous system that includes four different types of computational resources: a quad-core Intel Haswell CPU, an embedded Intel HD6000 GPU, a discrete NVIDIA GPU and an Altera FPGA.

[1]  Romain Dolbeau,et al.  One OpenCL to rule them all? , 2013, 2013 IEEE 6th International Workshop on Multi-/Many-core Computing Systems (MuCoCoS).

[2]  John Freeman,et al.  From opencl to high-performance hardware on FPGAS , 2012, 22nd International Conference on Field Programmable Logic and Applications (FPL).

[3]  Paul A. Viola,et al.  Fast Multi-view Face Detection , 2003 .

[4]  Rafael Asenjo,et al.  Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[5]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[6]  Hiroaki Kobayashi,et al.  SPRAT: Runtime processor selection for energy-aware computing , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Ioana Burcea,et al.  A compiler and runtime for heterogeneous computing , 2012, DAC Design Automation Conference 2012.

[8]  Thomas S. Huang,et al.  A data driven method for feature transformation , 2012, 2012 IEEE Conference on Computer Vision and Pattern Recognition.