OMB-UM: Design, Implementation, and Evaluation of CUDA Unified Memory Aware MPI Benchmarks

Unified Memory (UM) has significantly simplified the task of programming CUDA applications. With UM, the CUDA driver is responsible for managing the data movement between CPU and GPU and the programmer can focus on the actual designs. However, the performance of Unified Memory codes has not been on par with explicit device buffer based code. To this end, the latest NVIDIA Pascal and Volta GPUs with hardware support such as fine-grained page faults offer the best of both worlds, i.e., high-productivity and high-performance. However, these enhancements in the newer generation GPU architectures need to be evaluated in a different manner, especially in the context of MPI+CUDA applications. In this paper, we extend the widely used MPI benchmark — OSU Micro-benchmarks (OMB) to support Unified Memory or Managed Memory based MPI benchmarks. The current version of OMB cannot effectively characterize UM-Aware MPI design because CUDA driver movements are not captured appropriately with standardized Host and Device buffer based benchmarks. To address this key challenge, we propose new designs for the OMB suite and extend point to point and collective benchmarks that exploit sender and receiver side CUDA kernels to emulate the effective location of the UM buffer on Host and Device. The new benchmarks allow the users to better understand the performance of codes with UM buffers through user-selectable knobs that enable or disable sender and receiver side CUDA kernels. In addition to the design and implementation, we provide a comprehensive performance evaluation of the new UM benchmarks in the OMB-UM suite on a wide variety of systems and MPI libraries. From these evaluations we also provide valuable insights on the performance of various MPI libraries on UM buffers which can lead to further improvement in the performance of UM in CUDA-Aware MPI libraries.

[1]  Dhabaleswar K. Panda,et al.  Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation , 2018, 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[2]  Dhabaleswar K. Panda,et al.  CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[3]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[4]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[5]  Dhabaleswar K. Panda,et al.  Designing high performance communication runtime for GPU managed memory: early experiences , 2016, GPGPU@PPoPP.

[6]  Hal Finkel,et al.  Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading , 2017, LLVM-HPC@SC.

[7]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[8]  Chao Liu,et al.  Using High Level GPU Tasks to Explore Memory and Communications Options on Heterogeneous Platforms , 2017, SEM4HPC@HPDC.

[9]  Dhabaleswar K. Panda,et al.  OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters , 2012, EuroMPI.

[10]  Paweł Czarnul,et al.  Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs , 2019, The Journal of Supercomputing.

[11]  Dhabaleswar K. Panda,et al.  Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[12]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[13]  Dhabaleswar K. Panda,et al.  Performance Evaluation of MPI Libraries on GPU-Enabled OpenPOWER Architectures: Early Experiences , 2019, ISC Workshops.

[14]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[15]  Dhabaleswar K. Panda,et al.  Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures , 2019, GPGPU@ASPLOS.