Memory-aware TLP throttling and cache bypassing for GPUs
暂无分享,去创建一个
Hai Tan | Yanxiang He | Jun Zhang | Qing’an Li | Fanfan Shen
[1] Jianlong Zhong,et al. Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.
[2] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[3] William J. Dally,et al. The GPU Computing Era , 2010, IEEE Micro.
[4] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[5] Wen-mei W. Hwu,et al. Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, ISCA 1997.
[6] Scott A. Mahlke,et al. Dynamic Resource Management for Efficient Utilization of Multitasking GPUs , 2017, ASPLOS.
[7] Won Woo Ro,et al. Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[8] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[9] Erik Lindholm,et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.
[10] Wen-mei W. Hwu,et al. Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.
[11] Dong active st century Li. Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors , 2014 .
[12] Greg Humphreys,et al. How GPUs Work , 2007, Computer.
[13] Rami G. Melhem,et al. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[14] Mikko H. Lipasti,et al. Adaptive Cache and Concurrency Allocation on GPGPUs , 2015, IEEE Computer Architecture Letters.
[15] Mainak Chaudhuri,et al. Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[16] Yi Yang,et al. Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).
[17] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[18] Won Woo Ro,et al. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[19] Mateo Valero,et al. Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[20] Yu Wang,et al. Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[21] Mateo Valero,et al. Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[22] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.
[23] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[24] Carole-Jean Wu,et al. Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).
[25] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[26] Scott A. Mahlke,et al. Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.
[27] Yan Solihin,et al. Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.
[28] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[29] Zhang Jun,et al. Thread Scheduling Optimization of General Purpose Graphics Processing Unit:A Survey , 2016 .
[30] Henry P. Moreton,et al. The GeForce 6800 , 2005, IEEE Micro.
[31] Daniel A. Jiménez,et al. Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.
[32] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.
[33] Changjun Jiang,et al. FLEP: Enabling Flexible and Efficient Preemption on GPUs , 2017, ASPLOS.
[34] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[35] Won Woo Ro,et al. Warped-preexecution: A GPU pre-execution approach for improving latency hiding , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[36] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.