A Framework for Memory Oversubscription Management in Graphics Processing Units

Modern discrete GPUs support unified memory and demand paging. Automatic management of data movement between CPU memory and GPU memory dramatically reduces developer effort. However, when application working sets exceed physical memory capacity, the resulting data movement can cause great performance loss. This paper proposes a memory management framework, called ETC, that transparently improves GPU performance under memory oversubscription using new techniques to overlap eviction latency of GPU pages, reduce thrashing cost, and increase effective memory capacity. Eviction latency can be hidden by eagerly creating space for demand-paged data with proactive eviction (E). Thrashing costs can be ameliorated with memory-aware throttling (T), which dynamically reduces \reviithe GPU parallelism when page fault frequencies become high. Capacity compression (C) can enable larger working sets without increasing physical memory capacity. No single technique fits all workloads, and, thus, ETC integrates proactive eviction, memory-aware throttling and capacity compression into a principled framework that dynamically selects the most effective combination of techniques, transparently to the running software. To this end, ETC categorizes applications into three categories: regular applications without data sharing across kernels, regular applications with data sharing across kernels, and irregular applications. Our evaluation shows that ETC fully mitigates the oversubscription overhead for regular applications without data sharing and delivers performance similar to the ideal unlimited GPU memory baseline. We also show that ETC outperforms the state-of-the-art baseline by 60.4% and 270% for regular applications with data sharing and irregular applications, respectively.

[1]  Laszlo A. Belady,et al.  A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..

[2]  David L. Black,et al.  Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.

[3]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[4]  Per Stenström,et al.  Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[5]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[6]  Anand Sivasubramaniam,et al.  Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.

[7]  Per Stenström,et al.  A Robust Main-Memory Compression Scheme , 2005, ISCA 2005.

[8]  Gil Neiger,et al.  Intel ® Virtualization Technology for Directed I/O , 2006 .

[9]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[10]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[12]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[13]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[14]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[15]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[16]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[17]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[18]  Mahmut T. Kandemir,et al.  Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Alan L. Cox,et al.  SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[20]  Margaret Martonosi,et al.  Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[21]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[22]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Jaehyuk Huh,et al.  Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[24]  Nam Sung Kim,et al.  Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[26]  Onur Mutlu,et al.  Linearly compressed pages: A main memory compression framework with low complexity and low latency , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[27]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[28]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[29]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Aamer Jaleel,et al.  CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[31]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[32]  Michael M. Swift,et al.  Efficient virtual memory for big memory servers , 2013, ISCA.

[33]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[34]  Abhishek Bhattacharjee,et al.  Large-reach memory management unit caches: Coalesced and shared memory management unit caches to accelerate TLB miss handling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Margaret Martonosi,et al.  TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.

[36]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[37]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Yi Yang,et al.  Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[39]  Gabriel H. Loh,et al.  Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[40]  Rajeev Balasubramonian,et al.  Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[41]  Scott A. Mahlke,et al.  VAST: The illusion of a large memory space for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[42]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[43]  Mahmut T. Kandemir,et al.  Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[44]  Michael M. Swift,et al.  Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[45]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[46]  Jaehyuk Huh,et al.  Fast Two-Level Address Translation for Virtualized Systems , 2015, IEEE Transactions on Computers.

[47]  Mahmut T. Kandemir,et al.  Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.

[48]  Mahmut T. Kandemir,et al.  A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[49]  Onur Mutlu,et al.  Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface , 2015, ArXiv.

[50]  MutluOnur,et al.  A case for core-assisted bottleneck acceleration in GPUs , 2015 .

[51]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[52]  Stephen W. Keckler,et al.  Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.

[53]  Frank Bellosa,et al.  GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping , 2015, VEE.

[54]  Xin Tong,et al.  Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[55]  Natalie D. Enright Jerger,et al.  Interconnect-Memory Challenges for Multi-chip, Silicon Interposer Systems , 2015, MEMSYS.

[56]  Osman S. Unsal,et al.  Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[57]  Onur Mutlu,et al.  Toggle-Aware Compression for GPUs , 2015, IEEE Computer Architecture Letters.

[58]  Scott A. Mahlke,et al.  Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[59]  Mattan Erez,et al.  Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[60]  Onur Mutlu,et al.  A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[61]  David W. Nellans,et al.  Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[62]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[63]  Michael M. Swift,et al.  Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[64]  Xiuhong Li,et al.  Efficient kernel management on GPUs , 2016, DATE 2016.

[65]  Onur Mutlu,et al.  Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..

[66]  Xiuhong Li,et al.  Efficient kernel management on GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[67]  Onur Mutlu,et al.  Zorua: A holistic approach to resource virtualization in GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[68]  Natalia Gimelshein,et al.  vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69]  H. Reza Taheri,et al.  Performance Implications of Extended Page Tables on Virtualized x86 Processors , 2016, VEE.

[70]  Rachata Ausavarungnirun,et al.  Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[71]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[72]  Chen Meng,et al.  Training Deeper Models by GPU Memory Optimization on TensorFlow , 2017 .

[73]  Jason Cong,et al.  Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[74]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[75]  AngryCalc NVIDIA GeForce GTX 1050 Ti , 2018 .

[76]  Rachata Ausavarungnirun,et al.  MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.

[77]  Yan Solihin,et al.  Scheduling Page Table Walks for Irregular GPU Applications , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[78]  Minsoo Rhu,et al.  Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[79]  Minsoo Rhu,et al.  A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks , 2018, IEEE Computer Architecture Letters.

[80]  Mohamed Ibrahim,et al.  Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[81]  Abhinav Vishnu,et al.  Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing , 2020, Future Gener. Comput. Syst..