Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures

The CUDA Unified Memory (UM) interface enables a significantly simpler programming paradigm and has the potential to fundamentally change the way programmers write CUDA applications in the future. Although UM leads to high productivity in programming using CUDA by simplifying the programmer's view of CPU and GPU memory spaces, initial support for UM in the Kepler series of GPUs lacked in performance necessitating several UM-aware designs in state-of-the-art MPI runtimes such as MVAPICH2-GDR. This has enabled end MPI applications to take advantage of the high productivity promised by UM along with high performance. However, as CUDA runtimes and GPU architectures advance, the performance offered by UM has also improved significantly. Thus, there is a need to re-evaluate the performance characteristics of UM in light of these changes to understand how the UM-aware designs in state-of-the-art MPI runtimes must be adapted. We take up this broad challenge and characterize the performance of UM-aware MPI operations and to gain insights on how MPI runtimes need to deal with UM-based data residing on GPU and CPU for different generations of GPU architectures. Our characterization studies show that UM designs conceived during the Kepler GPU era still stands valid and provide valuable performance improvement on the latest Pascal and Volta GPUs. Furthermore, performance evaluation of optimized UM designs show that they outperform naive designs on MVAPICH2-GDR and Open MPI by 4.2x and 2.8x respectively for Intel systems. Additionally, the DD experiments for pure device transfers also show that MVAPICH2-GDR is up to 12.6x better than OpenMPI (w/ UCX).

[1]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[2]  Chao Liu,et al.  Using High Level GPU Tasks to Explore Memory and Communications Options on Heterogeneous Platforms , 2017, SEM4HPC@HPDC.

[3]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[4]  Dhabaleswar K. Panda,et al.  Designing high performance communication runtime for GPU managed memory: early experiences , 2016, GPGPU@PPoPP.

[5]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[6]  Dhabaleswar K. Panda,et al.  OC-DNN: Exploiting Advanced Unified Memory Capabilities in CUDA 9 and Volta GPUs for Out-of-Core DNN Training , 2018, 2018 IEEE 25th International Conference on High Performance Computing (HiPC).

[7]  Satoshi Matsuoka,et al.  DRAGON: Breaking GPU Memory Capacity Limits with Direct NVM Access , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Dhabaleswar K. Panda,et al.  Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.

[9]  Pawel Czarnul,et al.  Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications , 2017, The Journal of Supercomputing.

[10]  Gunter Saake,et al.  Memory Management Strategies in CPU/GPU Database Systems: A Survey , 2018, BDAS.

[11]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[12]  Marisa López-Vallejo,et al.  A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction Model , 2016, IEEE Transactions on Parallel and Distributed Systems.

[13]  Dhabaleswar K. Panda,et al.  CUDA M3: Designing Efficient CUDA Managed Memory-Aware MPI by Exploiting GDR and IPC , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[14]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[15]  Mahmoud Al-Ayyoub,et al.  Accelerating Levenshtein and Damerau edit distance algorithms using GPU with unified memory , 2017, 2017 8th International Conference on Information and Communication Systems (ICICS).

[16]  Gianluca Francini,et al.  GPU-only unified ConvMM layer for neural classifiers , 2017, 2017 4th International Conference on Control, Decision and Information Technologies (CoDIT).

[17]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[18]  Hal Finkel,et al.  Benchmarking and Evaluating Unified Memory for OpenMP GPU Offloading , 2017, LLVM-HPC@SC.

[19]  Peng Wang,et al.  High-Frequency Nonlinear Earthquake Simulations on Petascale Heterogeneous Supercomputers , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  David R. Kaeli,et al.  UMH , 2016, ACM Trans. Archit. Code Optim..

[21]  Alexandra Fedorova,et al.  Analyzing memory management methods on integrated CPU-GPU systems , 2017, ISMM.

[22]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[23]  Simon See,et al.  An Evaluation of Unified Memory Technology on NVIDIA GPUs , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[24]  Danilo Medeiros Eler,et al.  Performance Evaluation of Data Migration Methods Between the Host and the Device in CUDA-Based Programming , 2016 .