Increasing GPU Translation Reach by Leveraging Under-Utilized On-Chip Resources
暂无分享,去创建一个
Mahmut T. Kandemir | Gabriel H. Loh | Michael LeBeane | Jagadish B. Kotra | M. Kandemir | Michael LeBeane | G. Loh
[1] Youngjin Kwon,et al. Coordinated and Efficient Huge Page Management with Ingens , 2016, OSDI.
[2] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[3] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[4] Stijn Eyerman,et al. PIUMA: Programmable Integrated Unified Memory Architecture , 2020, ArXiv.
[5] Rachata Ausavarungnirun,et al. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.
[6] Xinxin Mei,et al. Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[7] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.
[8] Rami Melhem,et al. Enhancing Address Translations in Throughput Processors via Compression , 2020, PACT.
[9] Mahmut T. Kandemir,et al. Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications , 2014, GPGPU@ASPLOS.
[10] Nael B. Abu-Ghazaleh,et al. Constructing and Characterizing Covert Channels on GPGPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[11] Marco Maggioni,et al. Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.
[12] Jason Lowe-Power,et al. Filtering Translation Bandwidth with Virtual Caching , 2018, ASPLOS.
[13] Mikko H. Lipasti,et al. Compiler assisted coalescing , 2018, PACT.
[14] Mark Oskin,et al. A Software-Managed Approach to Die-Stacked DRAM , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).
[15] Aamer Jaleel,et al. DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems , 2019, ACM Trans. Archit. Code Optim..
[16] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[17] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[18] Bradford M. Beckmann,et al. Oversubscribed Command Queues in GPUs , 2018, GPGPU@PPoPP.
[19] Jee Ho Ryoo,et al. Rethinking TLB designs in virtualized environments: A very large part-of-memory TLB , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[20] Yan Solihin,et al. Neighborhood-Aware Address Translation for Irregular GPU Applications , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[21] Yan Solihin,et al. Scheduling Page Table Walks for Irregular GPU Applications , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[22] Kevin Skadron,et al. Pannotia: Understanding irregular GPGPU graph applications , 2013, 2013 IEEE International Symposium on Workload Characterization (IISWC).
[23] Ján Veselý,et al. Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[24] Rachata Ausavarungnirun,et al. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[25] Gabriel H. Loh,et al. Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.