论文信息 - Static Compilation Analysis for Host-Accelerator Communication Optimization

Static Compilation Analysis for Host-Accelerator Communication Optimization

We present an automatic, static program transformation that schedules and generates efficient memory transfers between a computer host and its hardware accelerator, addressing a well-known performance bottleneck. Our automatic approach uses two simple heuristics: to perform transfers to the accelerator as early as possible and to delay transfers back from the accelerator as late as possible. We implemented this transformation as a middle-end compilation pass in the pips /Par4All compiler. In the generated code, redundant communications due to data reuse between kernel executions are avoided. Instructions that initiate transfers are scheduled effectively at compile-time. We present experimental results obtained with the Polybench 2.0, some Rodinia benchmarks, and with a real numerical simulation. We obtain an average speedup of 4 to 5 when compared to a naive parallelization using a modern gpu with Par4All , hmpp , and pgi , and 3.5 when compared to an OpenMP version using a 12-core multiprocessor.

Fabien Coelho | Ronan Keryell | Mehdi Amini | François Irigoin

[1] Mehdi Amini,et al. A Particle-Mesh Integrator for Galactic Dynamics Powered by GPGPUs , 2009, ICCS.

[2] Pierre Jouvelot,et al. Semantical interprocedural parallelization: an overview of the PIPS project , 1991 .

[3] Rudolf Eigenmann,et al. OpenMPC: Extended OpenMP Programming and Tuning for GPUs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[4] Hiroki Honda,et al. OMPCUDA : OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler , 2010, IWOMP.

[5] Kevin Skadron,et al. HotSpot: a compact thermal modeling methodology for early-stage VLSI design , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[6] Michael Gerndt,et al. Optimizing Communication in Superb , 1990, CONPAR.

[7] Pierre Jouvelot,et al. PIPS Is not (just) Polyhedral Software Adding GPU Code Generation in PIPS , 2011 .

[8] François Irigoin,et al. Interprocedural Array Region Analyses , 1996, International Journal of Parallel Programming.

[9] P. Feautrier. Parametric integer programming , 1988 .

[10] Samuel Williams,et al. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[11] Michael Wolfe,et al. Implementing the PGI Accelerator model , 2010, GPGPU-3.

[12] Yifeng Chen,et al. Large-scale FFT on GPU clusters , 2010, ICS '10.

[13] Rudolf Eigenmann,et al. OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[14] Rami G. Melhem,et al. Compilation Techniques for Optimizing Communication on Distributed-Memory Systems , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[15] Cédric Augonnet,et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[16] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17] Tarek S. Abdelrahman,et al. hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[18] François Bodin,et al. Heterogeneous multicore parallel programming for graphics processing units , 2009, Sci. Program..

[19] Bingsheng He,et al. Database compression on graphics processors , 2010, Proc. VLDB Endow..

[20] David I. August,et al. Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[21] Corinne Ancourt,et al. A Linear Algebra Framework for Static High Performance Fortran Code Distribution , 1997, Sci. Program..

[22] Vivek Sarkar,et al. JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.