Static code transformations for thread‐dense memory accesses in GPU computing

Due to the GPU's complex memory system and massive thread‐level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low‐latency on‐chip memory to avoid high latency of off‐chip memory accesses. Shared memory is an on‐chip memory, which is explicitly managed by programmers. Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread‐density for preloading in shared memory. The thread‐dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread‐level parallelism remains at the same level when selecting datasets for preloading. Finally, our source‐to‐source compiler allows to preload selected datasets in shared memory by transforming non‐optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs.

[1]  Jaejin Lee,et al.  Design and implementation of software-managed caches for multicores with local memory , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[2]  Yi Yang,et al.  Shared memory multiplexing: A novel way to improve GPGPU throughput , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[3]  Mahmut T. Kandemir,et al.  Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[4]  David R. Kaeli,et al.  Exploiting Memory Access Patterns to Improve Memory Performance in Data-Parallel Architectures , 2011, IEEE Transactions on Parallel and Distributed Systems.

[5]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[6]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[7]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[8]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[9]  Mehrzad Samadi,et al.  Memory-centric system interconnect design with hybrid memory cubes , 2013, PACT 2013.

[10]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[11]  Wen-mei W. Hwu,et al.  CUDA-Lite: Reducing GPU Programming Complexity , 2008, LCPC.

[12]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[13]  Hyesoon Kim,et al.  Spare register aware prefetching for graph algorithms on GPUs , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Rajeev Alur,et al.  Block-Size Independence for GPU Programs , 2018, SAS.

[15]  Carole-Jean Wu,et al.  CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[16]  Jianfei Wang,et al.  Incorporating selective victim cache into GPGPU for high‐performance computing , 2017, Concurr. Comput. Pract. Exp..

[17]  Ari B. Hayes,et al.  Unified on-chip memory allocation for SIMT architecture , 2014, ICS '14.

[18]  Rajeev Alur,et al.  GPUDrano: Detecting Uncoalesced Accesses in GPU Programs , 2017, CAV.

[19]  Donald S. Fussell,et al.  Priority-based cache allocation in throughput processors , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[20]  Henk Corporaal,et al.  Adaptive and transparent cache bypassing for GPUs , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Nam Sung Kim,et al.  CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[22]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  Allen D. Malony,et al.  Autotuning GPU Kernels via Static and Predictive Analysis , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[24]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[25]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[26]  Terence Parr,et al.  LL(*): the foundation of the ANTLR parser generator , 2011, PLDI '11.

[27]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[28]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[29]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[30]  Gagan Agrawal,et al.  Optimizing MapReduce for GPUs with effective shared memory usage , 2012, HPDC '12.

[31]  P. Sadayappan,et al.  Characterizing and enhancing global memory data coalescing on GPUs , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[32]  Xinxin Mei,et al.  Dissecting GPU Memory Hierarchy Through Microbenchmarking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[33]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[34]  Kenli Li,et al.  A Hybrid Parallel Solving Algorithm on GPU for Quasi-Tridiagonal System of Linear Equations , 2016, IEEE Transactions on Parallel and Distributed Systems.

[35]  Jason Cong,et al.  A reuse-aware prefetching scheme for scratchpad memory , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[36]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[37]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[38]  Ranjani Parthasarathi,et al.  Exploiting GPU memory hierarchy for accelerating a specialized stencil computation , 2017, Concurr. Comput. Pract. Exp..

[39]  Feng Ji,et al.  Using Shared Memory to Accelerate MapReduce on Graphics Processing Units , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[40]  Reena Panda,et al.  Statistical pattern based modeling of GPU memory access streams , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[41]  Scott A. Mahlke,et al.  APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[42]  Zhen Lin,et al.  Automatic data placement into GPU on-chip memory resources , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[43]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.