Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations
暂无分享,去创建一个
Dong Li | Jeffrey S. Vetter | Bo Wu | Guoyang Chen | Xipeng Shen | J. Vetter | Dong Li | Xipeng Shen | Bo Wu | Guoyang Chen
[1] Wu-chun Feng,et al. Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[2] Feng Qin,et al. GRace: a low-overhead mechanism for detecting data races in GPU programs , 2011, PPoPP '11.
[3] Mahmut T. Kandemir,et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.
[4] Jeff A. Stuart,et al. A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).
[5] Brad Calder,et al. Phase tracking and prediction , 2003, ISCA '03.
[6] Bo Wu,et al. Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.
[7] Larry Carter,et al. Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.
[8] Dean M. Tullsen,et al. Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.
[9] Chau-Wen Tseng,et al. Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.
[10] Mahmut T. Kandemir,et al. Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.
[11] Mahmut T. Kandemir,et al. Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.
[12] Keshav Pingali,et al. Morph algorithms on GPUs , 2013, PPoPP '13.
[13] Xipeng Shen,et al. A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[14] Diana Marculescu,et al. Characterizing chip-multiprocessor variability-tolerance , 2008, 2008 45th ACM/IEEE Design Automation Conference.
[15] Chau-Wen Tseng,et al. Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.
[16] Keshav Pingali,et al. A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).
[17] Xipeng Shen,et al. Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.
[18] Chen Ding,et al. Lightweight reference affinity analysis , 2005, ICS '05.
[19] Nam Sung Kim,et al. Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[20] Larry Carter,et al. Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).
[21] Anjul Patney,et al. Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.
[22] Laxmi N. Bhuyan,et al. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.
[23] Chen Ding,et al. Locality phase prediction , 2004, ASPLOS XI.
[24] Chen Ding,et al. Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.
[25] Dong Li,et al. PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
[26] Jianlong Zhong,et al. Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.
[27] Hyesoon Kim,et al. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[28] Ken Kennedy,et al. Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.
[29] Zhen Lin,et al. Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).
[30] Bo Wu,et al. Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.
[31] R. Govindarajan,et al. Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.
[32] Keshav Pingali,et al. Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.
[33] Kevin Skadron,et al. Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).
[34] Long Chen,et al. Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[35] Xipeng Shen,et al. On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.
[36] Margaret Martonosi,et al. Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.
[37] Onur Mutlu,et al. Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).
[38] Adam Betts,et al. GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.
[39] Nam Sung Kim,et al. The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.
[40] Rudolf Eigenmann,et al. Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.
[41] Ninghui Sun,et al. SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.
[42] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[43] Stijn Eyerman,et al. System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.
[44] Dieter Schmalstieg,et al. Whippletree , 2014, ACM Trans. Graph..
[45] Timo Aila,et al. Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.