Interplay between Hardware Prefetcher and Page Eviction Policy in CPU-GPU Unified Virtual Memory

Memory capacity in GPGPUs is a major challenge for data-intensive applications with their ever increasing memory requirement. To fit a workload into the limited GPU memory space, a programmer needs to manually divide the workload by tiling the working set and perform user-level data migration. To relieve the programmer from this burden, Unified Virtual Memory (UVM) was developed to support on-demand paging and migration, transparent to the user. It further takes care of the memory over-subscription issue by automatically performing page replacement in an oversubscribed GPU memory situation. However, we found that naïve handling of page faults can cause orders of magnitude slowdown in performance. Moreover, we observed that although prefetching of data from CPU to GPU can hide the page fault latency, the difference among various prefetching mechanisms can lead to drastically different performance results. To this end, we performed extensive experiments on GeForceGTX 1080ti GPUs with PCI-e 3.0 16x to discover that there exists an effective prefetch mechanism to enhance locality in GPU memory. However, as the GPU memory is filled to its capacity, such prefetching mechanism quickly proves to be counterproductive due to locality unaware eviction policy. This necessitates the design of new eviction policies that are aware of the hardware prefetcher semantics. We propose two new programmer-agnostic, locality-aware pre-eviction policies which leverage the mechanics of existing hardware prefetcher and thus incur no additional implementation and performance overhead. We demonstrate that combining the proposed tree-based pre-eviction policy with the hardware prefetcher provides an average of 93% and 18.5% performance speed-up compared to LRU based 4KB and 2MB page replacement strategies, respectively. We further examine the memory access pattern of GPU workloads under consideration to analyze the achieved performance speed-up.

[1]  Rachata Ausavarungnirun,et al.  Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[4]  AngryCalc NVIDIA GeForce GTX 1050 Ti , 2018 .

[5]  Frank Bellosa,et al.  GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping , 2015, VEE.

[6]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[7]  A. Azzouz 2011 , 2020, City.

[8]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[9]  Scott A. Mahlke,et al.  VAST: The illusion of a large memory space for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[10]  Wen-mei W. Hwu,et al.  Run-time spatial locality detection and optimization , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[11]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[12]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[13]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Yan Solihin,et al.  Scheduling Page Table Walks for Irregular GPU Applications , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[15]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[16]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[18]  Jun Yang,et al.  A Framework for Memory Oversubscription Management in Graphics Processing Units , 2019, ASPLOS.

[19]  Hai Jin,et al.  Hotplug or Ballooning: A Comparative Study on Dynamic Memory Management Techniques for Virtual Machines , 2015, IEEE Transactions on Parallel and Distributed Systems.

[20]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[21]  Jason Cong,et al.  Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).