A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters

Clusters with accelerators at each node have emerged as the dominant high-end architecture in recent years. Such systems can be extremely hard to program because of the underlying heterogeneity and the need for exploiting parallelism at multiple levels. Thus, easing parallel programming today requires not only high-level programming models, but ones from which hybrid parallelism can be extracted. In this paper, we focus on the following question: "can simple APIs be developed for several classes of popular scientific applications, to ease application development and yet maintain parallel efficiency, on clusters with accelerators?". We approach this problem by individually considering popular patterns that arise in scientific computations. By developing APIsfor generalized reductions, irregular reductions, and stencil computations, we show that several complex scientific applications can be supported. We enable compact specification of these applications (40% of the code size of MPI), while also enabling parallelization across nodes and devices within a node, and with work distribution across CPU and GPU cores. We enable a number of optimizations that are normally implemented by hand by scientific programmers. We compare well against existing MPI applications while scaling across nodes, and against handwritten CUDA applications for executions on a single GPU, and yet can scale by using all parallelism simultaneously. On a cluster with 64GPUs, we achieve speedups between 600 and 1800 over sequential(single CPU core) versions.

[1]  Gagan Agrawal,et al.  Optimizing MapReduce for GPUs with effective shared memory usage , 2012, HPDC '12.

[2]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[3]  Siegfried Benkner,et al.  Using explicit platform descriptions to support programming of heterogeneous many-core systems , 2012, Parallel Comput..

[4]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Sunita Chandrasekaran,et al.  Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[6]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Gagan Agrawal,et al.  Accelerating MapReduce on a coupled CPU-GPU architecture , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[9]  Robert J. Harrison,et al.  Global Arrays: a portable "shared-memory" programming model for distributed memory computers , 1994, Proceedings of Supercomputing '94.

[10]  Bruno Raffin,et al.  X-kaapi: A Multi Paradigm Runtime for Multicore Architectures , 2013, 2013 42nd International Conference on Parallel Processing.

[11]  Eduard Ayguadé,et al.  Programmability and portability for exascale: Top down programming methodology and tools with StarSs , 2013, J. Comput. Sci..

[12]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[13]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[14]  Jungwon Kim,et al.  A SnuCL implementation of the LINPACK benchmark on clusters with multi-GPU nodes , 2012, HiPC 2012.

[15]  David A. Padua,et al.  Performance Portability with the Chapel Language , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[16]  George Almási,et al.  Scalable RDMA performance in PGAS languages , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[17]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[18]  Eric Darve,et al.  Liszt: A domain specific language for building portable mesh-based PDE solvers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[19]  Frank Mueller,et al.  Auto-generation and auto-tuning of 3D stencil codes on GPU clusters , 2012, CGO '12.

[20]  G. R. Mudalige,et al.  OP2: An active library framework for solving unstructured mesh-based applications on multi-core and many-core architectures , 2012, 2012 Innovative Parallel Computing (InPar).

[21]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[22]  Yi Wang,et al.  SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[23]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[24]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[25]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[26]  Jing Zhang,et al.  OpenCL and the 13 dwarfs: a work in progress , 2012, ICPE '12.

[27]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[28]  Mitsuhisa Sato,et al.  Multiple-SPMD Programming Environment Based on PGAS and Workflow toward Post-petascale Computing , 2013, 2013 42nd International Conference on Parallel Processing.

[29]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[30]  Bronis R. de Supinski,et al.  Heterogeneous Task Scheduling for Accelerated OpenMP , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[31]  Joel H. Saltz,et al.  Parallelizing Molecular Dynamics Programs for Distributed Memory Machines: An Application of the Cha , 1994 .

[32]  Gagan Agrawal,et al.  An execution strategy and optimized runtime support for parallelizing irregular reductions on modern GPUs , 2011, ICS '11.