Encapsulated Synchronization and Load-Balance in Heterogeneous Programming

Programming models and techniques to exploit parallelism in accelerators, such as GPUs, are different from those used in traditional parallel models for shared- or distributed-memory systems. It is a challenge to blend different programming models to coordinate and exploit devices with very different characteristics and computation powers. This paper presents a new extensible framework model to encapsulate run-time decisions related to data partition, granularity, load balance, synchronization, and communication for systems including assorted GPUs. Thus, the main parallel code becomes independent of them, using internal topology and system information to transparently adapt the computation to the system. The programmer can develop specific functions for each architecture, or use existent specialized library functions for different CPU-core or GPU architectures. The high-level coordination is expressed using a programming model built on top of message-passing, providing portability across distributed- or shared-memory systems. We show with an example how to produce a parallel code that can be used to efficiently run on systems ranging from a Beowulf cluster to a machine with mixed GPUs. Our experimental results show how the run-time system, guided by hints about the computational-power ratios of different devices, can automatically part and distribute large computations across heterogeneous systems, improving the overall performance.

[1]  Wenguang Chen,et al.  MapCG: Writing parallel program portable between CPU and GPU , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[2]  George Karypis,et al.  Introduction to Parallel Computing Solution Manual , 2003 .

[3]  Arturo González-Escribano,et al.  Using Fermi Architecture Knowledge to Speed up CUDA and OpenCL Programs , 2012, 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications.

[4]  Arturo González-Escribano,et al.  Effortless and Efficient Distributed Data-Partitioning in Linear Algebra , 2010, 2010 IEEE 12th International Conference on High Performance Computing and Communications (HPCC).

[5]  Arturo González-Escribano,et al.  Automatic Data Partitioning Applied to Multigrid PDE Solvers , 2011, 2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing.

[6]  Qing-kui Chen,et al.  A Stream Processor Cluster Architecture Model with the Hybrid Technology of MPI and CUDA , 2009, 2009 First International Conference on Information Science and Engineering.

[7]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[8]  D. N. Ranasinghe,et al.  Accelerating high performance applications with CUDA and MPI , 2009, 2009 International Conference on Industrial and Information Systems (ICIIS).

[9]  Wen-mei W. Hwu,et al.  MCUDA: An Efficient Implementation of CUDA Kernels for Multi-core CPUs , 2008, LCPC.

[10]  Steven J. Deitz,et al.  User-defined distributions and layouts in chapel: philosophy and framework , 2010 .

[11]  Satnam Singh Computing without Processors , 2011, ACM Queue.

[12]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[13]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Karsten Schwan,et al.  A framework for dynamically instrumenting GPU compute applications within GPU Ocelot , 2011, GPGPU-4.

[15]  Ping Yao,et al.  CuHMMer: A load-balanced CPU-GPU cooperative bioinformatics application , 2010, 2010 International Conference on High Performance Computing & Simulation.