Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

A GPU's computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on program characteristics effectively, achieving only a fraction of the GPU's full potential. Moreover, current GPUs do not allow programmers or compilers to control this thread scheduling, forfeiting important optimization opportunities at the program level. This paper presents a transformation centered on Streaming Multiprocessors (SM); this software approach to circumventing the limitations of the hardware scheduler allows flexible program-level control of scheduling. By permitting precise control of job locality on SMs, the transformation overcomes inherent limitations in prior methods. With this technique, flexible control of GPU scheduling at the program level becomes feasible, which opens up new opportunities for GPU program optimizations. The second part of the paper explores how the new opportunities could be leveraged for GPU performance enhancement, what complexities there are, and how to address them. We show that some simple optimization techniques can enhance co-runs of multiple kernels and improve data locality of irregular applications, producing 20-33% average increase in performance, system throughput, and average turnaround time.

[1]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[2]  Feng Qin,et al.  GRace: a low-overhead mechanism for detecting data races in GPU programs , 2011, PPoPP '11.

[3]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[4]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[5]  Brad Calder,et al.  Phase tracking and prediction , 2003, ISCA '03.

[6]  Bo Wu,et al.  Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU , 2013, PPoPP '13.

[7]  Larry Carter,et al.  Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[8]  Dean M. Tullsen,et al.  Initial observations of the simultaneous multithreading Pentium 4 processor , 2003, 2003 12th International Conference on Parallel Architectures and Compilation Techniques.

[9]  Chau-Wen Tseng,et al.  Improving Locality for Adaptive Irregular Scientific Codes , 2000, LCPC.

[10]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[11]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[12]  Keshav Pingali,et al.  Morph algorithms on GPUs , 2013, PPoPP '13.

[13]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[14]  Diana Marculescu,et al.  Characterizing chip-multiprocessor variability-tolerance , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[15]  Chau-Wen Tseng,et al.  Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[16]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Xipeng Shen,et al.  Streamlining GPU applications on the fly: thread divergence elimination through runtime thread-data remapping , 2010, ICS '10.

[18]  Chen Ding,et al.  Lightweight reference affinity analysis , 2005, ICS '05.

[19]  Nam Sung Kim,et al.  Improving Throughput of Power-Constrained GPUs Using Dynamic Voltage/Frequency and Core Scaling , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[20]  Larry Carter,et al.  Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[21]  Anjul Patney,et al.  Task management for irregular-parallel workloads on the GPU , 2010, HPG '10.

[22]  Laxmi N. Bhuyan,et al.  A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures , 2013, TACO.

[23]  Chen Ding,et al.  Locality phase prediction , 2004, ASPLOS XI.

[24]  Chen Ding,et al.  Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[25]  Dong Li,et al.  PORPLE: An Extensible Optimizer for Portable Data Placement on GPU , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[27]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[28]  Ken Kennedy,et al.  Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[29]  Zhen Lin,et al.  Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[30]  Bo Wu,et al.  Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[31]  R. Govindarajan,et al.  Improving GPGPU concurrency with elastic kernels , 2013, ASPLOS '13.

[32]  Keshav Pingali,et al.  Data-Driven Versus Topology-driven Irregular Computations on GPUs , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[33]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[35]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[36]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[37]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Adam Betts,et al.  GPUVerify: a verifier for GPU kernels , 2012, OOPSLA '12.

[39]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[40]  Rudolf Eigenmann,et al.  Cetus - An Extensible Compiler Infrastructure for Source-to-Source Transformation , 2003, LCPC.

[41]  Ninghui Sun,et al.  SMAT: an input adaptive auto-tuner for sparse matrix-vector multiplication , 2013, PLDI.

[42]  Collin McCurdy,et al.  The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.

[43]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[44]  Dieter Schmalstieg,et al.  Whippletree , 2014, ACM Trans. Graph..

[45]  Timo Aila,et al.  Understanding the efficiency of ray traversal on GPUs , 2009, High Performance Graphics.