Exploring OpenSHMEM Model to Program GPU-based Extreme-Scale Systems

Extreme-scale systems with compute accelerators such as Graphical Processing Unit GPUs have become popular for executing scientific applications. These systems are typically programmed using MPI and CUDA for NVIDIA based GPUs. However, there are many drawbacks to the MPI+CUDA approach. The orchestration required between the compute and communication phases of the application execution, and the constraint that communication can only be initiated from serial portions on the Central Processing Unit CPU lead to scaling bottlenecks. To address these drawbacks, we explore the viability of using OpenSHMEMfor programming these systems. In this paper, first, we make a case for supporting GPU-initiated communication, and suitability of the OpenSHMEMprogramming model. Second, we present NVSHMEM, a prototype implementation of the proposed programming approach, port Stencil and Transpose benchmarks which are representative of many scientific applications from MPI+CUDA model to OpenSHMEM, and evaluate the design and implementation of NVSHMEM. Finally, we provide a discussion on the opportunities and challenges of OpenSHMEMto program these systems, and propose extensions to OpenSHMEMto achieve the full potential of this programming approach.

[1]  Duncan Roweth,et al.  Thread-Safe SHMEM Extensions , 2014, OpenSHMEM.

[2]  Rajeev Thakur,et al.  Issues in Developing a Thread-Safe MPI Implementation , 2006, PVM/MPI.

[3]  Forum Mpi MPI: A Message-Passing Interface , 1994 .

[4]  James Dinan,et al.  Contexts: A Mechanism for High Throughput Communication in OpenSHMEM , 2014, PGAS.

[5]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[6]  Hiroki Honda,et al.  FLAT: a GPU programming framework to provide embedded MPI , 2012, GPGPU-5.

[7]  Anthony Skjellum,et al.  High Performance MPI : Extending the Message Passing Interface for Higher Performance and Higher Predictability , 2001 .

[8]  David A. Wood,et al.  A Primer on Memory Consistency and Cache Coherence , 2012, Synthesis Lectures on Computer Architecture.

[9]  Vijay Saraswat,et al.  GPU programming in a high level language: compiling X10 to CUDA , 2011, X10 '11.

[10]  Corporate The MPI Forum,et al.  MPI: a message passing interface , 1993, Supercomputing '93.

[11]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[12]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[13]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.