XcalableACC: Extension of XcalableMP PGAS Language Using OpenACC for Accelerator Clusters

The present paper introduces the XcalableACC (XACC) programming model, which is a hybrid model of the XcalableMP (XMP) Partitioned Global Address Space (PGAS) language and OpenACC. XACC defines directives that enable programmers to mix XMP and OpenACC directives in order to develop applications that can use accelerator clusters with ease. Moreover, in order to improve the performance of stencil applications, the Omni XACC compiler provides functions that can transfer a halo region on accelerator memory via Tightly Coupled Accelerators (TCA), which is a proprietary network for transferring data directly among accelerators. In the present paper, we evaluate the productivity and the performance of XACC through implementations of the HIMENO Benchmark. The results show that thanks to the productivity improvements, XACC requires less than half the source lines of code compare to a combination of Message Passing Interface (MPI) and OpenACC, which is commonly used together as a typical programming model. As a result of these performance improvements, XACC using TCA achieved up to 2.7 times faster performance than could be obtained via the combination of OpenACC and MPI programming model using GPUDirect RDMA over InfiniBand.

[1]  Mitsuhisa Sato,et al.  PEACH2: An FPGA-based PCIe network device for Tightly Coupled Accelerators , 2014, CARN.

[2]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[3]  Jack Dongarra,et al.  MPI - The Complete Reference: Volume 1, The MPI Core , 1998 .

[4]  George Almási PGAS (Partitioned Global Address Space) Languages , 2011, Encyclopedia of Parallel Computing.

[5]  Mitsuhisa Sato,et al.  Implementation of XcalableMP Device Acceleration Extention with OpenCL , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[6]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[7]  Mitsuhisa Sato,et al.  Productivity and Performance of the HPC Challenge Benchmarks with the XcalableMP PGAS Language , 2013 .

[8]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[9]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[10]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[11]  Bradford L. Chamberlain,et al.  Using the High Productivity Language Chapel to Target GPGPU Architectures , 2011 .

[12]  Mitsuhisa Sato,et al.  Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[13]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  Marc Snir,et al.  The MPI core , 1998 .

[15]  Mitsuhisa Sato,et al.  An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters , 2011, Euro-Par Workshops.

[16]  Mitsuhisa Sato,et al.  Productivity and Performance of Global-View Programming with XcalableMP PGAS Language , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[17]  Stefan Marr,et al.  Partitioned Global Address Space Languages , 2015, ACM Comput. Surv..

[18]  Mitsuhisa Sato,et al.  A Source-to-Source OpenACC Compiler for CUDA , 2013, Euro-Par Workshops.

[19]  Sunita Chandrasekaran,et al.  Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.

[20]  Francisco de Sande,et al.  accULL: An OpenACC Implementation with CUDA and OpenCL Support , 2012, Euro-Par.

[21]  李 珍泌 A study on productive and reliable programming environment for distributed memory system , 2012 .