Porting irregular reductions on heterogeneous CPU-GPU configurations

Heterogeneous architectures are playing a significant role in High Performance Computing (HPC) today, with the popularity of accelerators like the GPUs, and the new trend towards the integration of CPUs and GPUs. Developing applications that can effectively use these architectures is a major challenge. In this paper, we focus on one of the dwarfs in the Berkeley view on parallel computing, which are the irregular applications arising from unstructured grids. We consider the problem of executing these reductions on heterogeneous architectures comprising a multi-core CPU and a GPU. We have developed a Multi-level Partitioning Framework, which has the following features: 1) it supports GPU execution of irregular reductions even when the dataset size exceeds the size of the device memory, 2) it can enable pipelining of partitioning performed on the CPU, and the computations on the GPU, and 3) it supports dynamic distribution of work between the multi-core CPU and the GPU. Our extensive evaluation using two different irregular applications demonstrates the effectiveness of our approach.

[1]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[2]  Geoffrey C. Fox,et al.  Runtime Support and Compilation Methods for User-Specified Irregular Data Distributions , 1995, IEEE Trans. Parallel Distributed Syst..

[3]  K. Kennedy,et al.  Index Array Flattening Through Program Transformation , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[4]  Metin Nafi Gürcan,et al.  Coordinating the use of GPU and CPU for improving performance of compute intensive applications , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[5]  Ken Kennedy,et al.  Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[6]  Scott M. Murman,et al.  Performance of a new CFD flow solver using a hybrid programming paradigm , 2005, J. Parallel Distributed Comput..

[7]  Prithviraj Banerjee,et al.  Exploiting spatial regularity in irregular iterative applications , 1995, Proceedings of 9th International Parallel Processing Symposium.

[8]  Joel H. Saltz,et al.  Interprocedural data flow based optimizations for distributed memory compilation , 1997 .

[9]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[10]  Andrew B. White,et al.  Trailblazing with Roadrunner , 2009, Computing in Science & Engineering.

[11]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[12]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Joel H. Saltz,et al.  ICASE Report No . 92-12 / iVG / / ff 3 J / ICASE THE DESIGN AND IMPLEMENTATION OF A PARALLEL UNSTRUCTURED EULER SOLVER USING SOFTWARE PRIMITIVES , 2022 .

[14]  von Hanxledenreinhard D Newsletter #9 Handling Irregular Problems with Fortran D | a Preliminary Report Handling Irregular Problems with Fortran D | a Preliminary Report , 1993 .

[15]  Amar Shan,et al.  Heterogeneous processing: a strategy for augmenting moore's law , 2006 .

[16]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[17]  Harry Berryman,et al.  Distributed Memory Compiler Design for Sparse Problems , 1995, IEEE Trans. Computers.

[18]  Joel H. Saltz,et al.  Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the Cha , 1994 .

[19]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[20]  Chau-Wen Tseng,et al.  A Comparison of Locality Transformations for Irregular Codes , 2000, LCR.

[21]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[22]  Hasan U. Akay,et al.  Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems , 2000 .

[23]  Chau-Wen Tseng,et al.  Improving Compiler and Run-Time Support for Irregular Reductions Using Local Writes , 1998, LCPC.

[24]  James R. Larus,et al.  Efficient support for irregular applications on distributed-memory machines , 1995, PPOPP '95.

[25]  David A. Padua,et al.  On the Automatic Parallelization of Sparse and Irregular Fortran Programs , 1998, LCR.

[26]  Gagan Agrawal,et al.  An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs , 2011, ICS '11.

[27]  Michael Garland Sparse matrix computations on manycore GPU’s , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[28]  Dimitri J. Mavriplis,et al.  The design and implementation of a parallel unstructured Euler solver using software primitives , 1992 .

[29]  Emilio L. Zapata,et al.  A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors , 2000, ICS '00.

[30]  Surendra Byna,et al.  Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory , 2010, SPAA '10.