Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming
暂无分享,去创建一个
Won Woo Ro | Hyeran Jeon | Murali Annavaram | Keunsoo Kim | Qiumin Xu | M. Annavaram | Hyeran Jeon | Qiumin Xu | Keunsoo Kim | W. Ro
[1] Donald Yeung,et al. Learning-Based SMT Processor Resource Distribution via Hill-Climbing , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).
[2] Sandhya Dwarkadas,et al. Compatible phase co-scheduling on a CMP of multi-threaded processors , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[3] Mateo Valero,et al. Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[4] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[5] A. Janiszewski,et al. Architectural support for enhanced SMT job scheduling , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[6] Jianlong Zhong,et al. Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.
[7] Jung Ho Ahn,et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[8] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[9] Dean M. Tullsen,et al. Symbiotic jobscheduling with priorities for a simultaneous multithreading processor , 2002, SIGMETRICS '02.
[10] Francisco J. Cazorla,et al. Optimal task assignment in multithreaded processors: a statistical approach , 2012, ASPLOS XVII.
[11] Jin Wang,et al. Dynamic Thread Block Launch: A lightweight execution mechanism to support irregular applications on GPUs , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[12] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[13] Nam Sung Kim,et al. Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[14] Yun Liang,et al. Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[15] Onur Mutlu,et al. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.
[16] Scott A. Mahlke,et al. Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.
[17] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[18] Kevin Skadron,et al. Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.
[19] Aaftab Munshi,et al. The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).
[20] Francisco J. Cazorla,et al. Dynamically Controlled Resource Allocation in SMT Processors , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).
[21] A. Snavely,et al. Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.
[22] Indrani Paul,et al. Achieving Exascale Capabilities through Heterogeneous Computing , 2015, IEEE Micro.
[23] Francisco J. Cazorla,et al. Predictable performance in SMT processors: synergy between the OS and SMTs , 2006, IEEE Transactions on Computers.
[24] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[25] Tulika Mitra,et al. Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[26] Lizhong Chen,et al. Futility Scaling: High-Associativity Cache Partitioning , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[27] Yale N. Patt,et al. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).
[28] Daniel Pérez Palomar,et al. Practical algorithms for a family of waterfilling solutions , 2005, IEEE Transactions on Signal Processing.
[29] Hyeran Jeon,et al. Graph processing on GPUs: Where are the bottlenecks? , 2014, 2014 IEEE International Symposium on Workload Characterization (IISWC).
[30] Yan Solihin,et al. Predicting inter-thread cache contention on a chip multi-processor architecture , 2005, 11th International Symposium on High-Performance Computer Architecture.
[31] Ravi Iyer,et al. Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[32] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[33] S. Kim,et al. Fair cache sharing and partitioning in a chip multiprocessor architecture , 2004, Proceedings. 13th International Conference on Parallel Architecture and Compilation Techniques, 2004. PACT 2004..
[34] Hyeran Jeon,et al. Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[35] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.
[36] Yi Yang,et al. Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[37] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[38] Rami G. Melhem,et al. Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).
[39] Scott A. Mahlke,et al. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[40] Won Woo Ro,et al. Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).
[41] Avi Mendelson,et al. Many-Core vs. Many-Thread Machines: Stay Away From the Valley , 2009, IEEE Computer Architecture Letters.
[42] Nam Sung Kim,et al. Fair share: Allocation of GPU resources for both performance and fairness , 2014, 2014 IEEE 32nd International Conference on Computer Design (ICCD).
[43] Murali Annavaram,et al. PATS: Pattern aware scheduling and power gating for GPGPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[44] Dave Johnson,et al. 4.8 A 28nm x86 APU optimized for power and area efficiency , 2015, 2015 IEEE International Solid-State Circuits Conference - (ISSCC) Digest of Technical Papers.
[45] Tom R. Halfhill. NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .
[46] Stijn Eyerman,et al. Probabilistic job symbiosis modeling for SMT processor scheduling , 2010, ASPLOS XV.
[47] David R. Kaeli,et al. Runtime Support for Adaptive Spatial Partitioning and Inter-Kernel Communication on GPUs , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.
[48] Mahmut T. Kandemir,et al. Anatomy of GPU Memory System for Multi-Application Execution , 2015, MEMSYS.