论文信息 - Improving Multi-Application Concurrency Support Within the GPU Memory System

Improving Multi-Application Concurrency Support Within the GPU Memory System

GPUs exploit a high degree of thread-level parallelism to hide long-latency stalls. Due to the heterogeneous compute requirements of different applications, there is a growing need to share the GPU across multiple applications in large-scale computing environments. However, while CPUs offer relatively seamless multi-application concurrency, and are an excellent fit for multitasking and for virtualized environments, GPUs currently offer only primitive support for multi-application concurrency. Much of the problem in a contemporary GPU lies within the memory system, where multi-application execution requires virtual memory support to manage the address spaces of each application and to provide memory protection. In this work, we perform a detailed analysis of the major problems in state-of-the-art GPU virtual memory management that hinders multi-application execution. Existing GPUs are designed to share memory between the CPU and GPU, but do not handle multi-application support within the GPU well. We find that when multiple applications spatially share the GPU, there is a significant amount of inter-core thrashing on the shared TLB within the GPU. The TLB contention is high enough to prevent the GPU from successfully hiding stall latencies, thus becoming a first-order performance concern. We introduce MASK, a memory hierarchy design that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK extends the GPU memory hierarchy to efficiently support address translation through the use of multi-level TLBs, and uses translation-aware memory and cache management to maximize throughput in the presence of inter-application contention.

[1] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[2] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[3] Onur Mutlu,et al. Coordinated control of multiple prefetchers in multi-core systems , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[5] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[6] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[7] Carole-Jean Wu,et al. MCM-GPU: Multi-chip-module GPUs for continued performance scalability , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[8] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[9] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[10] Mattan Erez,et al. A QoS-aware memory controller for dynamically balancing GPU and CPU bandwidth use in an MPSoC , 2012, DAC Design Automation Conference 2012.

[11] Scott A. Mahlke,et al. VAST: The illusion of a large memory space for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[12] Rami G. Melhem,et al. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[13] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[14] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[15] Shinpei Kato,et al. Gdev: First-Class GPU Resource Management in the Operating System , 2012, USENIX Annual Technical Conference.

[16] Aamer Jaleel,et al. High performance cache replacement using re-reference interval prediction (RRIP) , 2010, ISCA.

[17] Yaozu Dong,et al. A Full GPU Virtualization Solution with Mediated Pass-Through , 2014, USENIX Annual Technical Conference.

[18] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[19] Mark Silberstein,et al. PTask: operating system abstractions to manage GPUs as compute devices , 2011, SOSP.

[20] Federico Silla,et al. rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[21] Shinpei Kato,et al. GPUvm: Why Not Virtualizing GPUs at the Hypervisor? , 2014, USENIX Annual Technical Conference.

[22] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[23] Tor M. Aamodt,et al. Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[24] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .

[25] Jason Cong,et al. Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[26] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.

[27] Kevin Kai-Wei Chang,et al. SQUASH: Simple QoS-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2015, ArXiv.

[28] Rudolf Eigenmann,et al. Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks , 2017, PPOPP.

[29] Karthikeyan Sankaralingam,et al. iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[30] Yu Wang,et al. Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[31] Mateo Valero,et al. Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[32] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches , 2013, MICRO.

[33] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[34] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[35] J. E. Thornton,et al. Parallel operation in the control data 6600 , 1964, AFIPS '64 (Fall, part II).

[36] Margaret Martonosi,et al. Inter-core cooperative TLB for chip multiprocessors , 2010, ASPLOS XV.

[37] David W. Nellans,et al. Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[38] Michael J. Flynn,et al. Very high-speed computing systems , 1966 .

[39] Ian Karlin,et al. LULESH 2.0 Updates and Changes , 2013 .

[40] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[41] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[42] Onur Mutlu,et al. The evicted-address filter: A unified mechanism to address both cache pollution and thrashing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43] Xuhao Chen,et al. Adaptive Cache Bypass and Insertion for Many-core Accelerators , 2014, MES '14.

[44] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[45] Won Woo Ro,et al. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[46] Donald S. Fussell,et al. Priority-based cache allocation in throughput processors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[47] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[48] Aamer Jaleel,et al. Adaptive insertion policies for high performance caching , 2007, ISCA '07.

[49] Aamer Jaleel,et al. Adaptive insertion policies for managing shared caches , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[50] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[51] AngryCalc. NVIDIA GeForce GTX 1050 Ti , 2018 .

[52] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[53] Martin Schulz,et al. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[54] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[55] Mahmut T. Kandemir,et al. Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.

[56] Vikram K. Narayana,et al. Symbiotic scheduling of concurrent GPU kernels for performance and energy optimizations , 2014, Conf. Computing Frontiers.

[57] Daniel J. Sorin,et al. UNified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[58] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[59] Kevin Kai-Wei Chang,et al. DASH: Deadline-Aware High-Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators , 2016, ACM Trans. Archit. Code Optim..

[60] Rajeev Balasubramonian,et al. Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[61] Ben Sander,et al. Applying AMD's Kaveri APU for heterogeneous computing , 2014, 2014 IEEE Hot Chips 26 Symposium (HCS).

[62] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[63] Lan Vu,et al. GPU virtualization for high performance general purpose computing on the ESX hypervisor , 2014, SpringSim.

[64] Yun Liang,et al. An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[65] Paul F. Reynolds,et al. Parallel Operations , 1989 .

[66] Mahmut T. Kandemir,et al. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[67] Thomas F. Wenisch,et al. Unlocking bandwidth for GPUs in CC-NUMA systems , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[68] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[69] B J Smith,et al. A pipelined, shared resource MIMD computer , 1986 .

[70] Matthew Poremba,et al. Design and Analysis of an APU for Exascale Computing , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[71] Ján Veselý,et al. Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[72] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[73] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[74] Andreas Moshovos,et al. Demystifying GPU microarchitecture through microbenchmarking , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[75] Stijn Eyerman,et al. Restating the Case for Weighted-IPC Metrics to Evaluate Multiprogram Workload Performance , 2014, IEEE Computer Architecture Letters.

[76] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[77] Kevin Kai-Wei Chang,et al. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[78] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[79] Muli Ben-Yehuda,et al. IOMMU: strategies for mitigating the IOTLB bottleneck , 2010, ISCA'10.

[80] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[81] Abhishek Bhattacharjee,et al. Efficient Address Translation for Architectures with Multiple Page Sizes , 2017, ASPLOS.