Communication Library to Overlap Computation and Communication for OpenCL Application

User-friendly parallel programming environments, such as CUDA and OpenCL are widely used for accelerators. They provide programmers with useful APIs, but the APIs are still low level primitives. Therefore, in order to apply communication optimization techniques, such as double buffering techniques, programmers have to manually write the programs with the primitives. Manual communication optimization requires programmers to have significant knowledge of both application characteristics and CPU-accelerator architecture. This prevents many application developers from effective utilization of accelerators. In addition, managing communication is a tedious and error-prone task even for expert programmers. Thus, it is necessary to develop a communication system which is highly abstracted but still capable of optimization. For this purpose, this paper proposes an OpenCL based communication library. To maximize performance improvement, the proposed library provides a simple but effective programming interface based on Stream Graph in order to specify an applications communication pattern. We have implemented a prototype system on OpenCL platform and applied it to several image processing applications. Our evaluation shows that the library successfully masks the details of accelerator memory management while it can achieve comparable speedup to manual optimization in which we use existing low level interfaces.

[1]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[2]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[3]  William Thies,et al.  StreamIt: A Language for Streaming Applications , 2002, CC.

[4]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[5]  William Thies,et al.  A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[6]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[7]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[8]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Saturnino Garcia,et al.  Kremlin: rethinking and rebooting gprof for the multicore age , 2011, PLDI '11.

[11]  Junying Chen,et al.  Medical Ultrasound Imaging: To GPU or Not to GPU? , 2011, IEEE Micro.