Memory-aware TLP throttling and cache bypassing for GPUs

General-purpose graphics processing unit (GPGPU) has become one of the most important high performance platforms oriented to high throughput applications. However, on-chip resources contention can often occur as there are large amounts of concurrent running threads inside GPGPU. It has become an important factor affecting the performance of GPGPUs. We propose memory-aware TLP throttling and cache bypassing (MATB) mechanism, which can exploit data temporal locality and memory bandwidth. It aims to make those cache blocks with good data locality stay inside L1D cache longer while maintaining on-chip resources utiliza- tion. On one hand, it can alleviate cache contention via limiting the memory warps with bad data reuse to be scheduled while cache contention and on-chip network congestion occur. On the other hand, it can make memory bandwidth be utilized more effectively via cache bypassing. Experimental results show MATB can achieve 26.6% and 14.2% performance improvement respectively on average relative to GTO and DYNCTA with low hardware cost.

[1]  Jianlong Zhong,et al.  Kernelet: High-Throughput GPU Kernel Executions with Dynamic Slicing and Scheduling , 2013, IEEE Transactions on Parallel and Distributed Systems.

[2]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[3]  William J. Dally,et al.  The GPU Computing Era , 2010, IEEE Micro.

[4]  Yi Yang,et al.  Warp-level divergence in GPUs: Characterization, impact, and mitigation , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, ISCA 1997.

[6]  Scott A. Mahlke,et al.  Dynamic Resource Management for Efficient Utilization of Multitasking GPUs , 2017, ASPLOS.

[7]  Won Woo Ro,et al.  Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[8]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[9]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[10]  Wen-mei W. Hwu,et al.  Run-time Adaptive Cache Hierarchy Via Reference Analysis , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[11]  Dong active st century Li Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors , 2014 .

[12]  Greg Humphreys,et al.  How GPUs Work , 2007, Computer.

[13]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[14]  Mikko H. Lipasti,et al.  Adaptive Cache and Concurrency Allocation on GPGPUs , 2015, IEEE Computer Architecture Letters.

[15]  Mainak Chaudhuri,et al.  Bypass and insertion algorithms for exclusive last-level caches , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[16]  Yi Yang,et al.  Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[18]  Won Woo Ro,et al.  Virtual Thread: Maximizing Thread-Level Parallelism beyond GPU Scheduling Limit , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[19]  Mateo Valero,et al.  Improving Cache Management Policies Using Dynamic Reuse Distances , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Yu Wang,et al.  Coordinated static and dynamic cache bypassing for GPUs , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[21]  Mateo Valero,et al.  Enabling preemptive multiprogramming on GPUs , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[22]  Shuaiwen Song,et al.  Locality-Driven Dynamic GPU Cache Bypassing , 2015, ICS.

[23]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[24]  Carole-Jean Wu,et al.  Ctrl-C: Instruction-Aware Control Loop Based Adaptive Cache Bypassing for GPUs , 2016, 2016 IEEE 34th International Conference on Computer Design (ICCD).

[25]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[26]  Scott A. Mahlke,et al.  Chimera: Collaborative Preemption for Multitasking on a Shared GPU , 2015, ASPLOS.

[27]  Yan Solihin,et al.  Counter-Based Cache Replacement and Bypassing Algorithms , 2008, IEEE Transactions on Computers.

[28]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[29]  Zhang Jun,et al.  Thread Scheduling Optimization of General Purpose Graphics Processing Unit:A Survey , 2016 .

[30]  Henry P. Moreton,et al.  The GeForce 6800 , 2005, IEEE Micro.

[31]  Daniel A. Jiménez,et al.  Adaptive GPU cache bypassing , 2015, GPGPU@PPoPP.

[32]  Dongrui Fan,et al.  Enabling coordinated register allocation and thread-level parallelism optimization for GPUs , 2018, MICRO.

[33]  Changjun Jiang,et al.  FLEP: Enabling Flexible and Efficient Preemption on GPUs , 2017, ASPLOS.

[34]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[35]  Won Woo Ro,et al.  Warped-preexecution: A GPU pre-execution approach for improving latency hiding , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[36]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.