Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects

Accelerators (such as NVIDIA GPUs) and coprocessors (such as Intel MIC/Xeon Phi) are fueling the growth of next-generation ultra-scale systems with high compute density and high performance per watt. However, they render these systems heterogeneous by introducing multiple levels of parallelism and varying computation/communication costs at each level. Application developers also use a hierarchy of programming models to extract maximum performance from these heterogeneous systems. They use models like CUDA, OpenCL, LEO+OpenMP, and others on an accelerator or a coprocessor while using higher level programming models like an MPI or a PGAS model across a cluster. Multiple programming models, their runtimes, and varying performance of communication at different levels of the system hierarchy have limited applications from the achieving peak performance on these systems. For example, in MPI and OpenSHMEM applications that run on GPU clusters, data is transferred between GPU and CPU using CUDA while it is exchanged between MPI processes running on different nodes using MPI or OpenSHMEM. This two-stage process for data movement introduces inefficiencies and hence limits performance. Communication in applications running on clusters with Intel MIC can happen over a myriad of channels depending on where the application processes are running: intra-MIC, intra-host, MIC-Host and MIC-MIC. Each of these channels has different performance characteristics. Runtimes have to be re-designed to optimize communication in such scenarios, while hiding system complexity from the user. ii Computation and communication overlap has been been a critical requirement for applications, to achieve peak performance on large-scale systems. Communication overheads have a magnified impact on heterogeneous clusters due to their higher compute density and hence, a higher wastage in compute power. Modern interconnects like InfiniBand, with their Remote DMA capabilities, enable asynchronous progress of communication, freeing up the cores to do useful computation. MPI and PGAS models offer light-weight, one-sided communication primitives that minimize process synchronization overheads and enable better computation and communication overlap. However, the design of one-sided communication on heterogeneous clusters is not well studied. Further, there has been limited literature to guide scientists in taking advantage of the one-sided communication semantics on high-end applications, more so on heterogeneous architectures. This dissertation has targeted several of these challenges for programming on GPU and Intel MIC clusters. Our work with MVAPICH2-GPU enabled the use of MPI in a unified manner, for communication from host and GPU device memories. It takes advantage of unified virtual addressing(UVA) provided by CUDA. We proposed designs in the MVAPICH2-GPU runtime to significantly improve the performance of internode and intranode GPU-GPU communication by pipelining and overlapping memory, PCIe and network transfers. We take advantage of CUDA features, such as IPC, GPUDirect RDMA, and CUDA kernels to further reduce communication overheads. MVAPICH2-GPU improves programmability by removing the need for developers to use CUDA and MPI for GPU-GPU communication, while improving performance through runtime-level optimizations that are transparent to the user. We have shown up to 69% and 45% improvement in point-to-point latency for data movement for 4Byte and 4MB messages, respectively. Likewise, the solutions improve the bandwidth by 2x and 56% for 4KByte and 64 KByte

[1]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[2]  Dhabaleswar K. Panda,et al.  Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Jing Zhu,et al.  Enabling Very-Large Scale Earthquake Simulations on Parallel Machines , 2007, International Conference on Computational Science.

[4]  Amith R. Mamidala,et al.  Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication , 2008, PVM/MPI.

[5]  Bernd Mohr,et al.  Scalable detection of MPI-2 remote memory access inefficiency patterns , 2009, Int. J. High Perform. Comput. Appl..

[6]  K. Milfeld,et al.  Early experiences with the intel many integrated cores accelerated computing technology , 2011 .

[7]  Sayantan Sur,et al.  Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems , 2010, Computer Science - Research and Development.

[8]  Sayantan Sur,et al.  Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[9]  Sayantan Sur,et al.  Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.

[10]  Sayantan Sur,et al.  Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.

[11]  Jing Zhu,et al.  Toward petascale earthquake simulations , 2009 .

[12]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Sriram Krishnamoorthy,et al.  Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters , 2010, 2010 IEEE International Conference on Cluster Computing.

[14]  Satoshi Matsuoka,et al.  Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Richard W. Vuduc,et al.  Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.

[16]  Lei Huang,et al.  Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation , 2010, LCPC.

[17]  Dhabaleswar K. Panda,et al.  Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[18]  Feng Ji,et al.  Efficient Intranode Communication in GPU-Accelerated Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[19]  Mitsuhisa Sato,et al.  An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters , 2011, Euro-Par Workshops.

[20]  Inanc Senocak,et al.  An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .

[21]  Dhabaleswar K. Panda,et al.  Impact of on-demand connection management in MPI over VIA , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[22]  Hyun-Wook Jin,et al.  High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[23]  Hyun-Wook Jin,et al.  Scheduling of MPI-2 one sided operations over InfiniBand , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[24]  Yutaka Ishikawa,et al.  Abstract: An MPI Library implementing Direct Communication for Many-Core Based Accelerators , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[25]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Nagiza F. Samatova,et al.  Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments , 2012, 2012 IEEE International Conference on Cluster Computing.

[27]  Carlos Rosales,et al.  Multiphase LBM Distributed over Multiple GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.

[28]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[29]  Dmitry Pekurovsky,et al.  P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..

[30]  Federico Silla,et al.  Enabling CUDA acceleration within virtual machines using rCUDA , 2011, 2011 18th International Conference on High Performance Computing.

[31]  Sayantan Sur,et al.  Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows , 2011, EuroMPI.

[32]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[33]  Charles L. Seitz,et al.  Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.

[34]  Larry Meadows,et al.  Experiments with WRF on Intel® Many Integrated Core (Intel MIC) Architecture , 2012, IWOMP.

[35]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[36]  K. Gopalakrishnan,et al.  Natively Supporting True One-Sided Communication in  MPI on Multi-core Systems with InfiniBand , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[37]  S. Rixner,et al.  An Event-driven Architecture for MPI Libraries , 2004 .

[38]  Sayantan Sur,et al.  MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit , 2011, 2011 IEEE International Conference on Cluster Computing.

[39]  Scott B. Baden,et al.  Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.

[40]  Dhabaleswar K. Panda,et al.  Designing passive synchronization for MPI-2 one-sided communication to maximize overlap , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[41]  Dhabaleswar K. Panda,et al.  Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[42]  Wu-chun Feng,et al.  VOCL: An optimized environment for transparent virtualization of graphics processing units , 2012, 2012 Innovative Parallel Computing (InPar).

[43]  Raymond Namyst,et al.  A multithreaded communication engine for multicore architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[44]  Keith D. Underwood,et al.  An analysis of the impact of MPI overlap and independent progress , 2004, ICS '04.

[45]  Feng Qiu,et al.  Zippy: A Framework for Computation and Visualization on a GPU Cluster , 2008, Comput. Graph. Forum.

[46]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[47]  Reagan Moore,et al.  Optimization and Scalability of an Large-scale Earthquake Simulation Application , 2006 .

[48]  Pradeep Dubey,et al.  Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.

[49]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[50]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[51]  P. Maechling,et al.  Strong shaking in Los Angeles expected from southern San Andreas earthquake , 2006 .

[52]  Massimiliano Fatica,et al.  Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[53]  Feng Ji,et al.  DMA-Assisted, Intranode Communication in GPU Accelerated Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[54]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[55]  K. PandaDhabaleswar,et al.  The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC , 2013 .

[56]  Jeffrey S. Vetter,et al.  Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.

[57]  Arthur A. Mirin,et al.  A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model , 2005, Int. J. High Perform. Comput. Appl..