HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters

An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain data patterns efficiently while incurring overheads for the others. In this paper, we first propose a set of optimized techniques to handle different MPI data types. Next, we propose a novel framework (HAND) that enables hybrid and adaptive selection among different techniques and tuning to achieve better performance with all data types. Our experimental results using the modified DDTBench suite demonstrate up to a 98% reduction in data type latency. We also apply this data type-aware design on an N-Body particle simulation application. Performance evaluation of this application on a 64 GPU cluster shows that our proposed approach can achieve up to 80% and 54% increase in performance by using struct and indexed data types compared to the existing best design. To the best of our knowledge, this is the first attempt to propose a hybrid and adaptive solution to integrate all existing schemes to optimize arbitrary non-contiguous data movement using MPI data types on GPU clusters.

[1]  Orion S. Lawlor,et al.  Message passing for GPGPU clusters: CudaMPI , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[2]  Torsten Hoefler,et al.  Performance Expectations and Guidelines for MPI Derived Datatypes , 2011, EuroMPI.

[3]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[4]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[5]  Sayantan Sur,et al.  Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 , 2011, 2011 IEEE International Conference on Cluster Computing.

[6]  Nagiza F. Samatova,et al.  Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data , 2014, IEEE Transactions on Parallel and Distributed Systems.

[7]  Dhabaleswar K. Panda,et al.  A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[9]  Torsten Hoefler,et al.  Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes , 2010, EuroMPI.

[10]  Satoshi Matsuoka,et al.  High performance 3-D FFT using multiple CUDA GPUs , 2012, GPGPU-5.

[11]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[12]  Robert B. Ross,et al.  Implementing Fast and Reusable Datatype Processing , 2003, PVM/MPI.

[13]  Torsten Hoefler,et al.  MPI datatype processing using runtime compilation , 2013, EuroMPI.

[14]  Satoshi Matsuoka,et al.  Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[15]  Torsten Hoefler,et al.  Micro-applications for Communication Data Access Patterns and MPI Datatypes , 2012, EuroMPI.

[16]  Mauro Bianco,et al.  A Generic Library for Stencil Computations , 2012, ArXiv.

[17]  Mark J. Harris,et al.  Parallel Prefix Sum (Scan) with CUDA , 2011 .