GPUdmm: A high-performance and memory-oblivious GPU architecture using dynamic memory management

GPU programmers suffer from programmer-managed GPU memory because both performance and programmability heavily depend on GPU memory allocation and CPU-GPU data transfer mechanisms. To improve performance and programmability, programmers should be able to place only the data frequently accessed by GPU on GPU memory while overlapping CPU-GPU data transfers and GPU executions as much as possible. However, current GPU architectures and programming models blindly place entire data on GPU memory, requiring a significantly large GPU memory size. Otherwise, they must trigger unnecessary CPU-GPU data transfers due to an insufficient GPU memory size. In this paper, we propose GPUdmm, a novel GPU architecture to enable high-performance and memory-oblivious GPU programming. First, GPUdmm uses GPU memory as a cache of CPU memory to provide programmers a view of the CPU memory-sized programming space. Second, GPUdmm achieves high performance by exploiting data locality and dynamically transferring data between CPU and GPU memories while effectively overlapping CPU-GPU data transfers and GPU executions. Third, GPUdmm can further reduce unnecessary CPU-GPU data transfers by exploiting simple programmer hints. Our carefully designed and validated experiments (e.g., PCIe/DMA timing) against representative benchmarks show that GPUdmm can achieve up to five times higher performance for the same GPU memory size, or reduce the GPU memory size requirement by up to 75% while maintaining the same performance.

[1]  Feng Ji,et al.  RSVM: A Region-based Software Virtual Memory for GPU , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[2]  David I. August,et al.  Automatic CPU-GPU communication management and optimization , 2011, PLDI '11.

[3]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[4]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[5]  Yan Solihin,et al.  CHOP: Adaptive filter-based DRAM caching for CMP server platforms , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[6]  Feng Liu,et al.  Dynamically managed data for CPU-GPU architectures , 2012, CGO '12.

[7]  Anand Raghunathan,et al.  A framework for efficient and scalable execution of domain-specific templates on GPUs , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[8]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  R. Govindarajan,et al.  Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Jaewon Lee,et al.  ScaleGPU: GPU Architecture for Memory-Unaware GPU Programming , 2014, IEEE Computer Architecture Letters.

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Phillipp Bergmann,et al.  Pci Express System Architecture , 2016 .

[13]  Emmett Kilgariff,et al.  Fermi GF100 GPU Architecture , 2011, IEEE Micro.

[14]  Andrew Brownsword,et al.  Hardware transactional memory for GPU architectures , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[16]  Kurt Keutzer,et al.  Optimizing the use of GPU memory in applications with large data sets , 2009, 2009 International Conference on High Performance Computing (HiPC).

[17]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[18]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[19]  Mark D. Hill,et al.  Efficiently enabling conventional block sizes for very large die-stacked DRAM caches , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[20]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[21]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[22]  Idit Keidar,et al.  GPUfs: Integrating a file system with GPUs , 2013, TOCS.

[23]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[24]  John Ayer,et al.  Understanding Performance of PCI Express Systems , 2008 .

[25]  John D. Owens,et al.  GPU-to-CPU Callbacks , 2010, Euro-Par Workshops.

[26]  Li Zhao,et al.  Exploring DRAM cache architectures for CMP server platforms , 2007, 2007 25th International Conference on Computer Design.

[27]  Kevin Skadron,et al.  Accelerating Braided B+ Tree Searches on a GPU with CUDA , 2011 .