Efficient kernel management on GPUs
暂无分享,去创建一个
[1] Zhen Lin,et al. Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.
[2] Yu Wang,et al. Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[3] Mateo Valero,et al. Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).
[4] Nam Sung Kim,et al. GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.
[5] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[6] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[7] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[8] Norman P. Jouppi,et al. Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[9] Ari B. Hayes,et al. Unified on-chip memory allocation for SIMT architecture , 2014, ICS '14.
[10] Yun Liang,et al. An efficient compiler framework for cache bypassing on GPUs , 2013, 2013 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).
[11] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[12] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.
[13] Maurice Herlihy,et al. Warp-aware trace scheduling for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[14] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[15] Yun Liang,et al. Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.
[16] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[17] Yun Liang,et al. An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.
[18] Mike O'Connor,et al. Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[19] Margaret Martonosi,et al. Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.
[20] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[21] Kevin Skadron,et al. Fine-grained resource sharing for concurrent GPGPU kernels , 2012, HotPar'12.
[22] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[23] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[24] Xiuhong Li,et al. Efficient kernel management on GPUs , 2016, DATE 2016.
[25] William J. Dally,et al. Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Xuhao Chen,et al. Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression , 2016, 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC).
[27] Xuhao Chen,et al. Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[28] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[29] Carole-Jean Wu,et al. CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[30] Daniel R. Johnson,et al. A variable warp size architecture , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).
[31] Zhen Lin,et al. Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[32] Richard W. Vuduc,et al. Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.
[33] Yue Zhao,et al. EffiSha: A Software Framework for Enabling Effficient Preemptive Scheduling of GPU , 2017, PPoPP.
[34] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[35] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[36] Dong Li,et al. Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.
[37] Shuaiwen Song,et al. Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.
[38] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[39] Dongrui Fan,et al. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.