Exposing Memory Access Patterns to Improve Instruction and Memory Efficiency in GPUs

Modern computing workloads often have high memory intensity, requiring high bandwidth access to memory. The memory request patterns of these workloads vary and include regular strided accesses and indirect (pointer-based) accesses. Such applications require a large number of address generation instructions and a high degree of memory-level parallelism. This article proposes new memory instructions that exploit strided and indirect memory request patterns and improve efficiency in GPU architectures. The new instructions reduce address calculation instructions by offloading addressing to dedicated hardware, and reduce destructive memory request interference by grouping related requests together. Our results show that we can eliminate 33% of dynamic instructions across 16 GPU benchmarks. These improvements result in an overall runtime improvement of 26%, an energy reduction of 18%, and a reduction in energy-delay product of 32%.

[1]  Christoforos E. Kozyrakis,et al.  Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks , 2002, MICRO.

[2]  Tor M. Aamodt,et al.  Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[4]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Scott A. Mahlke,et al.  APOGEE: Adaptive prefetching on GPUs for energy efficiency , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[6]  Jeffrey S. Vetter,et al.  Performance evaluation of the Cray X1 distributed shared memory architecture , 2004, Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects.

[7]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[8]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Alexander V. Veidenbaum,et al.  Compiler-directed data prefetching in multiprocessors with memory hierarchies , 1990, ICS '90.

[10]  Hyunseok Lee,et al.  An Alternative Memory Access Scheduling in Manycore Accelerators , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[11]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Carole-Jean Wu,et al.  CAWA: Coordinated warp scheduling and Cache Prioritization for critical warp acceleration of GPGPU workloads , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[13]  Srinivas Devadas,et al.  IMP: Indirect memory prefetcher , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[14]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[16]  Franz Franchetti,et al.  Efficient Utilization of SIMD Extensions , 2005, Proceedings of the IEEE.

[17]  Christopher Torng,et al.  Microarchitectural mechanisms to exploit value structure in SIMT architectures , 2013, ISCA.

[18]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[19]  Richard W. Vuduc,et al.  Many-Thread Aware Prefetching Mechanisms for GPGPU Applications , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Saman P. Amarasinghe,et al.  Exploiting superword level parallelism with multimedia instruction sets , 2000, PLDI '00.

[21]  Nam Sung Kim,et al.  G-Scalar: Cost-Effective Generalized Scalar Execution Architecture for Power-Efficient GPUs , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[22]  Jaeha Kim,et al.  Memory-centric system interconnect design with Hybrid Memory Cubes , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[23]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24]  Jean-Loup Baer,et al.  Effective Hardware Based Data Prefetching for High-Performance Processors , 1995, IEEE Trans. Computers.

[25]  Nam Sung Kim,et al.  Power-efficient computing for compute-intensive GPGPU applications , 2012, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Matthew Mattina,et al.  Tarantula: a vector extension to the alpha architecture , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[27]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[28]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[29]  Carole-Jean Wu,et al.  CAWS: Criticality-aware warp scheduling for GPGPU workloads , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[30]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[31]  Youfeng Wu,et al.  Efficient discovery of regular stride patterns in irregular programs and its use in compiler prefetching , 2002, PLDI '02.

[32]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[33]  Jason Cong,et al.  Polyhedral-based data reuse optimization for configurable computing , 2013, FPGA '13.

[34]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[35]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[36]  Janak H. Patel,et al.  Stride directed prefetching in scalar processors , 1992, MICRO.

[37]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[39]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).