Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources

Many GPU applications issue irregular memory accesses to a very large memory footprint. We confirm observations from prior work that these irregular access patterns are severely bottlenecked by insufficient Translation Lookaside Buffer (TLB) reach, resulting in expensive page table walks. In this work, we investigate mechanisms to improve TLB reach without increasing the page size or the size of the TLB itself. Our work is based around the observation that a GPU’s instruction cache (I-cache) and Local Data Share (LDS) scratchpad memory are under-utilized in many applications, including those that suffer from poor TLB reach. We leverage this to opportunistically utilize idle capacity and port bandwidth from the GPU’s I-cache and LDS structures for address translations. We explore various potential architectural designs for each structure to optimize performance and minimize complexity. Both structures are organized as a victim cache between the L1 and L2 TLBs to boost translation reach. We find that our designs can increase performance on average by 30.1% without impacting the performance of applications that do not require additional reach.

[1]  Youngjin Kwon,et al.  Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.

[2]  Abhishek Bhattacharjee,et al.  Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.

[3]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Stijn Eyerman,et al.  PIUMA: Programmable Integrated Unified Memory Architecture , 2020, ArXiv.

[5]  Rachata Ausavarungnirun,et al.  MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.

[6]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[7]  Alan L. Cox,et al.  Translation caching: skip, don't walk (the page table) , 2010, ISCA.

[8]  Rami Melhem,et al.  Enhancing Address Translations in Throughput Processors via Compression , 2020, PACT.

[9]  Mahmut T. Kandemir,et al.  Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications , 2014, GPGPU@ASPLOS.

[10]  Nael B. Abu-Ghazaleh,et al.  Constructing and Characterizing Covert Channels on GPGPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[12]  Jason Lowe-Power,et al.  Filtering Translation Bandwidth with Virtual Caching , 2018, ASPLOS.

[13]  Mikko H. Lipasti,et al.  Compiler assisted coalescing , 2018, PACT.

[14]  Mark Oskin,et al.  A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[15]  Aamer Jaleel,et al.  DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems , 2019, ACM Trans. Archit. Code Optim..

[16]  David A. Wood,et al.  Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[17]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Bradford M. Beckmann,et al.  Oversubscribed Command Queues in GPUs , 2018, GPGPU@PPoPP.

[19]  Jee Ho Ryoo,et al.  Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[20]  Yan Solihin,et al.  Neighborhood-Aware Address Translation for Irregular GPU Applications , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Yan Solihin,et al.  Scheduling Page Table Walks for Irregular GPU Applications , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[22]  Kevin Skadron,et al.  Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).

[23]  Ján Veselý,et al.  Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[24]  Rachata Ausavarungnirun,et al.  Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[25]  Gabriel H. Loh,et al.  Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.