Power Efficient Sharing-Aware GPU Data Management

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy spreads these CTAs to different streaming multiprocessor cores (SM) in a round-robin fashion. Since each SM has a private L1 cache, the shared data among CTAs are replicated across L1 caches of different SMs. Data replication reduces the effective L1 cache size which in turn increases the data movement and power consumption. The goal of this paper is to reduce data movement and increase effective cache space in GPUs. We propose a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy. The sharing-aware cache management approach dynamically classifies private and shared data. Private blocks are given higher priority to stay longer in L1 cache, and shared blocks are given higher priority to stay longer in L2 cache. Essentially, this approach increases the lifetime of shared blocks and private blocks in different cache levels. The experimental results show that the proposed scheme reduces the off-chip traffic by 19\% which translates to an average DRAM power reduction of 10% and performance improvement of 7%.

[1]  Xuhao Chen,et al.  Adaptive Cache Bypass and Insertion for Many-core Accelerators , 2014, MES '14.

[2]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[3]  Dong active st century Li Orchestrating Thread Scheduling and Cache Management to Improve Memory System Throughput in Throughput Processors , 2014 .

[4]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[5]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[6]  Nam Sung Kim,et al.  GPUWattch: enabling energy optimizations in GPGPUs , 2013, ISCA.

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Kevin Skadron,et al.  The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[10]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[11]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Shaahin Hessabi,et al.  Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality , 2015, IET Comput. Digit. Tech..

[13]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[14]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[16]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[17]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.