A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs
暂无分享,去创建一个
Xin Chen | Xubin He | Yuxin Wang | He Guo | Weijun Xiao | Yulong Yu | Xubin He | Weijun Xiao | He Guo | Yuxin Wang | Xin Chen | Yulong Yu
[1] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[2] Mahmut T. Kandemir,et al. Managing GPU Concurrency in Heterogeneous Architectures , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[3] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[4] Sriram Krishnamoorthy,et al. Efficient scheduling of recursive control flow on GPUs , 2013, ICS '13.
[5] Naga K. Govindaraju,et al. Mars: A MapReduce Framework on graphics processors , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).
[6] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[7] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[8] Hyesoon Kim,et al. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[9] Henry Wong,et al. Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.
[10] William J. Dally,et al. A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.
[11] Ahmad Khonsari,et al. Warp size impact in GPUs: large or small? , 2013, GPGPU@ASPLOS.
[12] Tor M. Aamodt,et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).
[13] Carole-Jean Wu,et al. CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[14] Milind Kulkarni,et al. SemCache: semantics-aware caching for efficient GPU offloading , 2016, ICS '13.
[15] R. Govindarajan,et al. Preemptive thread block scheduling with online structural runtime prediction for concurrent GPGPU kernels , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[16] Maurice Herlihy,et al. Warp-aware trace scheduling for GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).
[17] Tor M. Aamodt,et al. Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.
[18] Xin Chen,et al. A Credit-Based Load-Balance-Aware CTA Scheduling Optimization Scheme in GPGPU , 2014, International Journal of Parallel Programming.
[19] William J. Dally,et al. Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).
[20] John Kim,et al. Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[21] Margaret Martonosi,et al. MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).
[22] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[23] Jack J. Purdum,et al. C programming guide , 1983 .
[24] Kevin Skadron,et al. Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.
[25] Mike O'Connor,et al. Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[27] Xin Chen,et al. APR: A Novel Parallel Repacking Algorithm for Efficient GPGPU Parallel Code Transformation , 2014, GPGPU@ASPLOS.
[28] Wen-mei W. Hwu,et al. Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .
[29] Xiaoyuan Li,et al. Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[30] Yi Yang,et al. Exploiting uniform vector instructions for GPGPU performance, energy efficiency, and opportunistic reliability enhancement , 2013, ICS '13.
[31] Zhongliang Chen,et al. Scalar Waving: Improving the Efficiency of SIMD Execution on GPUs , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[32] J. Xu. OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .
[33] Mahmut T. Kandemir,et al. Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications , 2014, GPGPU@ASPLOS.
[34] Nicolas Brunie,et al. Simultaneous branch and warp interweaving for sustained GPU performance , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).
[35] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).