Static Compilation Analysis for Host-Accelerator Communication Optimization

We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips /Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled effectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naive parallelization using a modern gpu with Par4All , hmpp , and pgi , and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.

[1]  Mehdi Amini,et al.  A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs , 2009, ICCS.

[2]  Pierre Jouvelot,et al.  Semantical interprocedural parallelization: an overview of the PIPS project , 1991 .

[3]  Rudolf Eigenmann,et al.  OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Hiroki Honda,et al.  OMPCUDA : OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler , 2010, IWOMP.

[5]  Kevin Skadron,et al.  HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6]  Michael Gerndt,et al.  Optimizing Communication in Superb , 1990, CONPAR.

[7]  Pierre Jouvelot,et al.  PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS , 2011 .

[8]  François Irigoin,et al.  Interprocedural Array Region Analyses , 1996, International Journal of Parallel Programming.

[9]  P. Feautrier Parametric integer programming , 1988 .

[10]  Samuel Williams,et al.  Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[12]  Yifeng Chen,et al.  Large-scale FFT on GPU clusters , 2010, ICS '10.

[13]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[14]  Rami G. Melhem,et al.  Compilation Techniques for Optimizing Communication on Distributed-Memory Systems , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[15]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[16]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[18]  François Bodin,et al.  Heterogeneous multicore parallel programming for graphics processing units , 2009, Sci. Program..

[19]  Bingsheng He,et al.  Database compression on graphics processors , 2010, Proc. VLDB Endow..

[20]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[21]  Corinne Ancourt,et al.  A Linear Algebra Framework for Static High Performance Fortran Code Distribution , 1997, Sci. Program..

[22]  Vivek Sarkar,et al.  JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.