Efficient Data Communication between CPU and GPU through Transparent Partial-Page Migration

Despite the increasing investment in integrated GPUs and next-generation interconnect research, discrete GPUs connected by PCI Express still account for the dominant position of the market, the management of data communication between CPU and GPU continues to evolve. Initially, the programmer controls the data transfer between CPU and GPU explicitly. To simplify programming and enable system-wide atomic memory operations, GPU vendors have developed a programming model that provides a single virtual address space. The page migration engine in this model migrates pages between CPU and GPU on demand automatically. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnects, larger page migration has longer delay, which may reduce the overlap of computation and transmission and cause serious performance decline. In this paper, we propose partial-page migration that only migrates the requested part of a page to shorten the migration latency and avoid the performance degradation of the whole-page migration when the page becomes larger. Experiments show that partial-page migration is possible to significantly hide the performance overheads of whole-page migration when the page size is 2MB and the PCI Express bandwidth is 16GB/sec, converting an average 72.72× slowdown to a 1.29× speedup when compared with programmers controlled data transmission. Additionally, we examine the impact of page size on TLB miss rate and the performance impact of migration unit size on execution time, enabling designers to make informed decisions.

[1]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[2]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[3]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[4]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[5]  Stéphan Jourdan,et al.  Haswell: The Fourth-Generation Intel Core Processor , 2014, IEEE Micro.

[6]  David Patterson,et al.  The Top 10 Innovations in the New NVIDIA Fermi Architecture, and the Top 3 Next Challenges , 2009 .

[7]  David W. Nellans,et al.  Handling the problems and opportunities posed by multiple on-chip memory controllers , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[8]  David Kirk,et al.  NVIDIA cuda software and gpu parallel computing architecture , 2007, ISMM '07.

[9]  Paolo Prinetto,et al.  Fault mitigation strategies for CUDA GPUs , 2013, 2013 IEEE International Test Conference (ITC).

[10]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[11]  Margaret Martonosi,et al.  Reducing GPU offload latency via fine-grained CPU-GPU synchronization , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[12]  Phil Rogers,et al.  Heterogeneous system architecture overview , 2013, 2013 IEEE Hot Chips 25 Symposium (HCS).

[13]  Jasmin Ajanovic PCI express 3.0 overview , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[14]  William J. Dally,et al.  GPUs and the Future of Parallel Computing , 2011, IEEE Micro.

[15]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[16]  Jayshree Ghorpade,et al.  GPGPU Processing in CUDA Architecture , 2012, ArXiv.

[17]  Thomas F. Wenisch,et al.  Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[18]  Zhao Zhang,et al.  Flexible memory: A novel main memory architecture with block-level memory compression , 2015, 2015 IEEE International Conference on Networking, Architecture and Storage (NAS).

[19]  Jaewon Lee,et al.  ScaleGPU: GPU Architecture for Memory-Unaware GPU Programming , 2014, IEEE Computer Architecture Letters.

[20]  Raphael Landaverde,et al.  An investigation of Unified Memory Access performance in CUDA , 2014, 2014 IEEE High Performance Extreme Computing Conference (HPEC).