In-Depth Analyses of Unified Virtual Memory System for GPU Accelerated Computing

The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for the ease of use provided by systems-managed memory space with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is presently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for a novel in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocation for UVM and HMM motivates the improvement of the underlying system. We focus on a UVM-based system and investigate the root causes of the UVM overhead, which is a non-trivial task due to the complex interactions of multiple hardware and software constituents and the requirement of targeted analysis methodology. In this paper, we take a deep dive into the UVM system architecture and the internal behaviors of page fault generation and servicing. We reveal specific GPU hardware limitations using targeted benchmarks to uncover driver functionality as a real-time system when processing the resultant workload. We further provide a quantitative evaluation of fault handling for various applications under different scenarios, including prefetching and oversubscription. We find that the driver workload is dependent on the interactions among application access patterns, GPU hardware constraints, and Host OS components. We determine that the cost of host OS components is significant and present across implementations, warranting close attention. This study serves as a proxy for future shared memory systems such as those that interface with HMM.

[1]  Olga Pearce,et al.  RAJA: Portable Performance for Large-Scale Scientific Applications , 2019, 2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC).

[2]  David A. Bader,et al.  Traversing large graphs on GPUs with unified memory , 2020, Proc. VLDB Endow..

[3]  Dhabaleswar K. Panda,et al.  Characterizing CUDA Unified Memory (UM)-Aware MPI Designs on Modern GPU Architectures , 2019, GPGPU@ASPLOS.

[4]  John Tran,et al.  cuDNN: Efficient Primitives for Deep Learning , 2014, ArXiv.

[5]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Natalie N. Beams,et al.  High-Order Finite Element Method using Standard and Device-Level Batch GEMM on GPUs , 2020, 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[7]  Zhiying Wang,et al.  HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs , 2020, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[8]  Zhen Wang,et al.  Block-Relaxation Methods for 3D Constant-Coefficient Stencils on GPUs and Multicore CPUs , 2012, ArXiv.

[9]  Stefano Markidis,et al.  Performance Evaluation of Advanced Features in CUDA Unified Memory , 2019, 2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC).

[10]  PlimptonSteve Fast parallel algorithms for short-range molecular dynamics , 1995 .

[11]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[12]  Wen-mei W. Hwu,et al.  EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal In GPUs , 2020, Proc. VLDB Endow..

[13]  Chun Chen,et al.  Speeding up Nek5000 with autotuning and specialization , 2010, ICS '10.

[14]  Rami Melhem,et al.  Adaptive Page Migration for Irregular Data-intensive Applications under GPU Memory Oversubscription , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[15]  Marisa López-Vallejo,et al.  A Performance Study of CUDA UVM versus Manual Optimizations in a Real-World Setup: Application to a Monte Carlo Wave-Particle Event-Based Interaction Model , 2016, IEEE Transactions on Parallel and Distributed Systems.

[16]  Rami G. Melhem,et al.  Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory , 2019, 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA).

[17]  Paweł Czarnul,et al.  Performance evaluation of Unified Memory with prefetching and oversubscription for selected parallel CUDA applications on NVIDIA Pascal and Volta GPUs , 2019, The Journal of Supercomputing.

[18]  David Kaeli,et al.  Griffin: Hardware-Software Support for Efficient Page Migration in Multi-GPU Systems , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[19]  Jack Dongarra,et al.  Evaluation and Design of FFT for Distributed Accelerated Systems , 2018 .

[20]  Matt Martineau,et al.  GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models , 2016, ISC Workshops.

[21]  Ramyad Hadidi,et al.  Batch-Aware Unified Memory Management in GPUs for Irregular Workloads , 2020, ASPLOS.

[22]  Jack J. Dongarra,et al.  High-performance conjugate-gradient benchmark: A new metric for ranking high-performance computing systems , 2016, Int. J. High Perform. Comput. Appl..

[23]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).

[24]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[25]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[26]  Hui Guo,et al.  Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs , 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[27]  Zhiying Wang,et al.  A quantitative evaluation of unified memory in GPUs , 2019, The Journal of Supercomputing.

[28]  Massoud Pedram,et al.  FFT-based deep learning deployment in embedded systems , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[29]  David Kaeli,et al.  MGPU-TSM: A Multi-GPU System with Truly Shared Memory , 2020, ArXiv.

[30]  Rachata Ausavarungnirun,et al.  Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs , 2020, ArXiv.

[32]  Jack Deslippe,et al.  Comparing Managed Memory and ATS with and without Prefetching on NVIDIA Volta GPUs , 2019, 2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS).