Using the High Productivity Language Chapel to Target GPGPU Architectures

It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.

[1]  Vivek Sarkar,et al.  JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA , 2009, Euro-Par.

[2]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[3]  Justin P. Haldar,et al.  Accelerating advanced MRI reconstructions on GPUs , 2008, J. Parallel Distributed Comput..

[4]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[5]  Lawrence Snyder,et al.  A programmer's guide to ZPL , 1999 .

[6]  François Bodin,et al.  Heterogeneous multicore parallel programming for graphics processing units , 2009, Sci. Program..

[7]  Tarek S. Abdelrahman,et al.  hiCUDA: a high-level directive-based language for GPU programming , 2009, GPGPU-2.

[8]  Steven J. Deitz,et al.  Global-view abstractions for user-defined reductions and scans , 2006, PPoPP '06.

[9]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[10]  John E. Stone,et al.  Probing biomolecular machines with graphics processors , 2009, CACM.

[11]  François Bodin,et al.  Heterogeneous multicore parallel programming for graphics processing units , 2009 .

[12]  Bradford L. Chamberlain,et al.  Parallel Programmability and the Chapel Language , 2007, Int. J. High Perform. Comput. Appl..

[13]  A. Szalay,et al.  Bias and variance of angular correlation functions , 1993 .

[14]  Jia Guo,et al.  Writing productive stencil codes with overlapped tiling , 2009 .

[15]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[16]  Andrew B. White,et al.  Trailblazing with Roadrunner , 2009, Computing in Science & Engineering.

[17]  Sudhakar Yalamanchili,et al.  Ocelot: A dynamic optimization framework for bulk-synchronous applications in heterogeneous systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  Klaus Schulten,et al.  Accelerating Molecular Modeling Applications with GPU Computing , 2009 .

[19]  Michel Dupuis,et al.  Computation of electron repulsion integrals using the rys quadrature method , 1983 .

[20]  Mike Murphy,et al.  Efficient compilation of fine-grained SPMD-threaded programs for multicore CPUs , 2010, CGO '10.

[21]  Rohit Chandra,et al.  Parallel programming in openMP , 2000 .

[22]  Kevin Skadron,et al.  Scalable parallel programming , 2008, 2008 IEEE Hot Chips 20 Symposium (HCS).

[23]  Steven J. Deitz,et al.  User-defined distributions and layouts in chapel: philosophy and framework , 2010 .

[24]  C. H. Flood,et al.  The Fortress Language Specification , 2007 .

[25]  John E. Stone,et al.  Probing Biomolecular Machines with Graphics Processors , 2009, ACM Queue.

[26]  Nicolas Pinto,et al.  PyCUDA: GPU Run-Time Code Generation for High-Performance Computing , 2009, ArXiv.

[27]  Michael D. McCool,et al.  Performance evaluation of GPUs using the RapidMind development platform , 2006, SC.

[28]  Daisuke Takahashi,et al.  The HPC Challenge (HPCC) benchmark suite , 2006, SC.

[29]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[30]  Rudolf Eigenmann,et al.  OpenMP to GPGPU: a compiler framework for automatic translation and optimization , 2009, PPoPP '09.

[31]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[32]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.