An efficient compiler framework for cache bypassing on GPUs

Graphics Processing Units (GPUs) have become ubiquitous for general purpose applications due to their tremendous computing power. Initially, GPUs only employ scratchpad memory as on-chip memory. Though scratchpad memory benefits many applications, it is not ideal for those general purpose applications with irregular memory accesses. Hence, GPU vendors have introduced caches in conjunction with scratchpad memory in the recent generations of GPUs. The caches on GPUs are highly-configurable. The programmer or the compiler can explicitly control cache access or bypass for global load instructions. This highly-configurable feature of GPU caches opens up the opportunities for optimizing the cache performance. In this paper, we propose an efficient compiler framework for cache bypassing on GPUs. Our objective is to efficiently utilize the configurable cache and improve the overall performance for general purpose GPU applications. In order to achieve this goal, we first characterize GPU cache utilization and develop performance metrics to estimate the cache reuses and memory traffic. Next, we present efficient algorithms that judiciously select global load instructions for cache access or bypass. Finally, we integrate our techniques into an automatic compiler framework that leverages PTX instruction set architecture. Experiments evaluation demonstrates that compared to cache-all and bypass-all solutions, our techniques can achieve considerable performance improvement.

[1]  Arnold L. Rosenberg,et al.  Using the compiler to improve cache replacement decisions , 2002, Proceedings.International Conference on Parallel Architectures and Compilation Techniques.

[2]  Margaret Martonosi,et al.  Characterizing and improving the use of demand-fetched caches in GPUs , 2012, ICS '12.

[3]  Carole-Jean Wu,et al.  SHiP: Signature-based Hit Predictor for high performance caching , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[4]  Jaehyuk Huh,et al.  Cache bursts: A new approach for eliminating dead blocks and increasing cache efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[5]  Jing-Yang Jou,et al.  Cache Capacity Aware Thread Scheduling for Irregular Memory Access on many-core GPGPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[6]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[7]  Yu Wang,et al.  Run-time technique for simultaneous aging and power optimization in GPGPUs , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[8]  Luca Benini,et al.  An integrated hardware/software approach for run-time scratchpad management , 2004, Proceedings. 41st Design Automation Conference, 2004..

[9]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[10]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[11]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Wen-mei W. Hwu,et al.  Optimization principles and application performance evaluation of a multithreaded GPU using CUDA , 2008, PPoPP.

[13]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[14]  Tor M. Aamodt,et al.  Modeling Cache Contention and Throughput of Multiprogrammed Manycore Processors , 2012, IEEE Transactions on Computers.

[15]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Lifan Xu,et al.  Auto-tuning a high-level language targeted to GPU codes , 2012, 2012 Innovative Parallel Computing (InPar).

[17]  John D. Owens,et al.  GPU Computing , 2008, Proceedings of the IEEE.

[18]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Xipeng Shen,et al.  On-the-fly elimination of dynamic irregularities for GPU computing , 2011, ASPLOS XVI.

[20]  Yun Liang,et al.  Static analysis for fast and accurate design space exploration of caches , 2008, CODES+ISSS '08.

[21]  Yi Yang,et al.  A GPGPU compiler for memory optimization and parallelism management , 2010, PLDI '10.

[22]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[23]  Yun Liang,et al.  Efficient GPU Spatial-Temporal Multitasking , 2015, IEEE Transactions on Parallel and Distributed Systems.

[24]  Yun Liang,et al.  An efficient compiler framework for cache bypassing on GPUs , 2013, ICCAD 2013.

[25]  Rajeev Barua,et al.  Dynamic allocation for scratch-pad memory using compile-time decisions , 2006, TECS.

[26]  Yooseong Kim,et al.  CuMAPz: A tool to analyze memory access patterns in CUDA , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[27]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[28]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[29]  Yun Liang,et al.  An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[30]  Yun Liang,et al.  Register and thread structure optimization for GPUs , 2013, 2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC).

[31]  Thomas H. Cormen,et al.  Introduction to algorithms [2nd ed.] , 2001 .

[32]  Henk Corporaal,et al.  A detailed GPU cache model based on reuse distance theory , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[33]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[34]  Chyi-Chang Miao,et al.  Compiler managed micro-cache bypassing for high performance EPIC processors , 2002, MICRO.

[35]  Hiren D. Patel,et al.  On the use of GP-GPUs for accelerating compute-intensive EDA applications , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[36]  Shengkui Zhao,et al.  Real-time implementation and performance optimization of 3D sound localization on GPUs , 2012, 2012 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.