Correctly Treating Synchronizations in Compiling Fine-Grained SPMD-Threaded Programs for CPU

Automatic compilation for multiple types of devices is important, especially given the current trends towards heterogeneous computing. This paper concentrates on some issues in compiling fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPUs. It points out some correctness pitfalls in existing techniques, particularly in their treatment to implicit synchronizations. It then describes a systematic dependence analysis specially designed for handling implicit synchronizations in SPMD-threaded programs. By unveiling the relations between inter-thread data dependences and correct treatment to synchronizations, it presents a dependence-based solution to the problem. Experiments demonstrate that the proposed techniques can resolve the correctness issues in existing compilation techniques, and help compilers produce correct and efficient translation results.

[1]  Joshua S. Auerbach,et al.  Lime: a Java-compatible and synthesizable language for heterogeneous architectures , 2010, OOPSLA.

[2]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  Kevin Skadron,et al.  Increasing memory miss tolerance for SIMD cores , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[5]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[6]  Xiaoming Li,et al.  A control-structure splitting optimization for GPGPU , 2009, CF '09.

[7]  Bo Wu,et al.  Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[8]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[9]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[10]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[11]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[12]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[13]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[14]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[15]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[16]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Andrew Kerr,et al.  Translating GPU Binaries to Tiered SIMD Architectures with Ocelot , 2009 .

[19]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Christoph W. Kessler,et al.  SkePU: a multi-backend skeleton programming library for multi-GPU systems , 2010, HLPP '10.

[21]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures , 2009, IWOMP.

[22]  Sergei Gorlatch,et al.  Towards High-Level Programming of Multi-GPU Systems Using the SkelCL Library , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[23]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[24]  Scott A. Mahlke,et al.  Sponge: portable stream programming on graphics engines , 2011, ASPLOS XVI.

[25]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.