Unified on-chip memory allocation for SIMT architecture

The popularity of general purpose Graphic Processing Unit (GPU) is largely attributed to the tremendous concurrency enabled by its underlying architecture -- single instruction multiple thread (SIMT) architecture. It keeps the context of a significant number of threads in registers to enable fast ``context switches" when the processor is stalled due to execution dependence, memory requests and etc. The SIMT architecture has a large register file evenly partitioned among all concurrent threads. Per-thread register usage determines the number of concurrent threads, which strongly affects the whole program performance. Existing register allocation techniques, extensively studied in the past several decades, are oblivious to the register contention due to the concurrent execution of many threads. They are prone to making optimization decisions that benefit single thread but degrade the whole application performance. Is it possible for compilers to make register allocation decisions that can maximize the whole GPU application performance? We tackle this important question from two different aspects in this paper. We first propose an unified on-chip memory allocation framework that uses scratch-pad memory to help: (1) alleviate single-thread register pressure; (2) increase whole application throughput. Secondly, we propose a characterization model for the SIMT execution model in order to achieve a desired on-chip memory partition given the register pressure of a program. Overall, we discovered that it is possible to automatically determine an on-chip memory resource allocation that maximizes concurrency while ensuring good single-thread performance at compile-time. We evaluated our techniques on a representative set of GPU benchmarks with non-trivial register pressure. We are able to achieve up to 1.70 times speedup over the baseline of the traditional register allocation scheme that maximizes single thread performance.

[1]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[2]  Vivek Sarkar,et al.  Linear scan register allocation , 1999, TOPL.

[3]  John Cocke,et al.  Register Allocation Via Coloring , 1981, Comput. Lang..

[4]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  John Cocke,et al.  A methodology for the real world , 1981 .

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Gagan Agrawal,et al.  An integer programming framework for optimizing shared memory use on GPUs , 2010, 2010 International Conference on High Performance Computing.

[8]  Paola Batistoni,et al.  International Conference , 2001 .

[9]  Karthikeyan Sankaralingam,et al.  iGPU: Exception support and speculative execution on GPUs , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[10]  Andrew W. Appel,et al.  Optimal spilling for CISC machines with few registers , 2001, PLDI '01.

[11]  Ivan D. Baev Techniques for Region-Based Register Allocation , 2009, 2009 International Symposium on Code Generation and Optimization.

[12]  Rajeev Barua,et al.  Recursive function data allocation to scratch-pad memory , 2007, CASES '07.

[13]  G. Edward Suh,et al.  SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[14]  Fernando Magno Quintão Pereira,et al.  Spill Code Placement for SIMD Machines , 2012, SBLP.

[15]  Thomas R. Gross,et al.  Call-cost directed register allocation , 1997, PLDI '97.

[16]  Jens Palsberg,et al.  Register Allocation via Coloring of Chordal Graphs , 2005, APLAS.

[17]  Fred C. Chow Minimizing register usage penalty at procedure calls , 1988, PLDI '88.

[18]  Hwansoo Han,et al.  Optimal register reassignment for register stack overflow minimization , 2006, TACO.

[19]  Ken Kennedy,et al.  Vector Register Allocation , 1992, IEEE Trans. Computers.

[20]  Jian Wang,et al.  Software pipelining with register allocation and spilling , 1994, MICRO 27.

[21]  Yi Yang,et al.  Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[22]  Richard W. Vuduc,et al.  A performance analysis framework for identifying potential benefits in GPGPU applications , 2012, PPoPP '12.

[23]  William J. Dally,et al.  A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors , 2012, TOCS.

[24]  Frances E. Allen,et al.  Proceedings of the 1982 SIGPLAN symposium on Compiler construction , 1982 .

[25]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[26]  Josep Llosa,et al.  Hypernode reduction modulo scheduling , 1995, MICRO 28.

[27]  Sebastian Hack,et al.  Register allocation for programs in SSA form , 2006, CC.

[28]  Joseph S. Sventek,et al.  Efficient dynamic heap allocation of scratch-pad memory , 2008, ISMM '08.