A Framework for Memory Oversubscription Management in Graphics Processing Units
暂无分享,去创建一个
Jun Yang | Chen Li | Yang Guo | Rachata Ausavarungnirun | Onur Mutlu | Youtao Zhang | Christopher J. Rossbach | O. Mutlu | Rachata Ausavarungnirun | Jun Yang | Youtao Zhang | C. Rossbach | Chen Li | Yang Guo
[1] Laszlo A. Belady,et al. A Study of Replacement Algorithms for Virtual-Storage Computer , 1966, IBM Syst. J..
[2] David L. Black,et al. Translation lookaside buffer consistency: a software approach , 1989, ASPLOS III.
[3] Mark D. Hill,et al. Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.
[4] Per Stenström,et al. Recency-based TLB preloading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[5] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).
[6] Anand Sivasubramaniam,et al. Going the distance for TLB prefetching: an application-driven study , 2002, ISCA.
[7] Per Stenström,et al. A Robust Main-Memory Compression Scheme , 2005, ISCA 2005.
[8] Gil Neiger,et al. Intel ® Virtualization Technology for Directed I/O , 2006 .
[9] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[10] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[11] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[12] Margaret Martonosi,et al. Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.
[13] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[14] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[15] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[16] John E. Stone,et al. OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.
[17] Alan L. Cox,et al. Translation caching: skip, don't walk (the page table) , 2010, ISCA.
[18] Mahmut T. Kandemir,et al. Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[19] Alan L. Cox,et al. SpecTLB: A mechanism for speculative address translation , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[20] Margaret Martonosi,et al. Shared last-level TLBs for chip multiprocessors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[21] David R. Kaeli,et al. Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.
[22] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[23] Jaehyuk Huh,et al. Revisiting hardware-assisted page walks for virtualized systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[24] Nam Sung Kim,et al. Lossless and lossy memory I/O link compression for improving performance of GPGPU workloads , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[25] Onur Mutlu,et al. Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[26] Onur Mutlu,et al. Linearly compressed pages: A main memory compression framework with low complexity and low latency , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).
[27] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[28] Lifan Xu,et al. Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).
[29] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[30] Aamer Jaleel,et al. CoLT: Coalesced Large-Reach TLBs , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[31] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[32] Michael M. Swift,et al. Efficient virtual memory for big memory servers , 2013, ISCA.
[33] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[34] Abhishek Bhattacharjee,et al. Large-reach memory management unit caches: Coalesced and shared memory management unit caches to accelerate TLB miss handling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[35] Margaret Martonosi,et al. TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs , 2013, TACO.
[36] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[37] Onur Mutlu,et al. Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[38] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[39] Gabriel H. Loh,et al. Increasing TLB reach by exploiting clustering in page translations , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[40] Rajeev Balasubramonian,et al. Managing DRAM Latency Divergence in Irregular GPGPU Applications , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.
[41] Scott A. Mahlke,et al. VAST: The illusion of a large memory space for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[42] Abhishek Bhattacharjee,et al. Architectural support for address translation on GPUs: designing memory management units for CPU/GPUs with unified address spaces , 2014, ASPLOS.
[43] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[44] Michael M. Swift,et al. Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[45] David A. Wood,et al. Supporting x86-64 address translation for 100s of GPU lanes , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[46] Jaehyuk Huh,et al. Fast Two-Level Address Translation for Virtualized Systems , 2015, IEEE Transactions on Computers.
[47] Mahmut T. Kandemir,et al. Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.
[48] Mahmut T. Kandemir,et al. A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[49] Onur Mutlu,et al. Simultaneous Multi Layer Access: A High Bandwidth and Low Cost 3D-Stacked Memory Interface , 2015, ArXiv.
[50] MutluOnur,et al. A case for core-assisted bottleneck acceleration in GPUs , 2015 .
[51] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.
[52] Stephen W. Keckler,et al. Page Placement Strategies for GPUs within Heterogeneous Memory Systems , 2015, ASPLOS.
[53] Frank Bellosa,et al. GPUswap: Enabling Oversubscription of GPU Memory through Transparent Swapping , 2015, VEE.
[54] Xin Tong,et al. Prediction-based superpage-friendly TLB designs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[55] Natalie D. Enright Jerger,et al. Interconnect-Memory Challenges for Multi-chip, Silicon Interposer Systems , 2015, MEMSYS.
[56] Osman S. Unsal,et al. Redundant Memory Mappings for fast access to large memories , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[57] Onur Mutlu,et al. Toggle-Aware Compression for GPUs , 2015, IEEE Computer Architecture Letters.
[58] Scott A. Mahlke,et al. Mascar: Speeding up GPU warps by reducing memory pitstops , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[59] Mattan Erez,et al. Bit-Plane Compression: Transforming Data for Better Compression in Many-Core Architectures , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[60] Onur Mutlu,et al. A case for toggle-aware compression for GPU systems , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[61] David W. Nellans,et al. Towards high performance paged memory for GPUs , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[62] Ján Veselý,et al. Observations and opportunities in architecting shared virtual memory for heterogeneous systems , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[63] Michael M. Swift,et al. Agile Paging: Exceeding the Best of Nested and Shadow Paging , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[64] Xiuhong Li,et al. Efficient kernel management on GPUs , 2016, DATE 2016.
[65] Onur Mutlu,et al. Simultaneous Multi-Layer Access , 2016, ACM Trans. Archit. Code Optim..
[66] Xiuhong Li,et al. Efficient kernel management on GPUs , 2016, 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE).
[67] Onur Mutlu,et al. Zorua: A holistic approach to resource virtualization in GPUs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[68] Natalia Gimelshein,et al. vDNN: Virtualized deep neural networks for scalable, memory-efficient neural network design , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[69] H. Reza Taheri,et al. Performance Implications of Extended Page Tables on Virtualized x86 Processors , 2016, VEE.
[70] Rachata Ausavarungnirun,et al. Mosaic: A GPU Memory Manager with Application-Transparent Support for Multiple Page Sizes , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[71] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).
[72] Chen Meng,et al. Training Deeper Models by GPU Memory Optimization on TensorFlow , 2017 .
[73] Jason Cong,et al. Supporting Address Translation for Accelerator-Centric Architectures , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[74] Stephen W. Keckler,et al. Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[75] AngryCalc. NVIDIA GeForce GTX 1050 Ti , 2018 .
[76] Rachata Ausavarungnirun,et al. MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency , 2018, ASPLOS.
[77] Yan Solihin,et al. Scheduling Page Table Walks for Irregular GPU Applications , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).
[78] Minsoo Rhu,et al. Beyond the Memory Wall: A Case for Memory-Centric HPC System for Deep Learning , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[79] Minsoo Rhu,et al. A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks , 2018, IEEE Computer Architecture Letters.
[80] Mohamed Ibrahim,et al. Efficient and Fair Multi-programming in GPUs via Effective Bandwidth Management , 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[81] Abhinav Vishnu,et al. Scaling Deep Learning workloads: NVIDIA DGX-1/Pascal and Intel Knights Landing , 2020, Future Gener. Comput. Syst..