SAWS: Synchronization aware GPGPU warp scheduling for multiple independent warp schedulers

General-purpose computing on Graphics Processing Units (GPGPUs) became increasingly popular for a wide range of applications beyond traditional graphic rendering workloads. GPGPU exploits parallelism in applications via multithreading to hide memory latencies, and handles control complexity by barrier synchronizations. Warp scheduling algorithms have been optimized to increase memory latency hiding capability, improve cache behavior, or alleviate branch divergence. To date, there is no scheduler that accounts for the synchronization behavior among warps under the presence of multiple warp schedulers. In this paper, we develop a warp scheduling algorithm that is synchronization aware. The key observation is that excessive stall cycles may be introduced due to synchronizing warps residing in different warp schedulers. We propose that schedulers coordinate with each other to avoid warps from being blocked on a barrier for overly long. Such coordination will dynamically reorder the execution sequence of warps so that they are issued in proximity when a barrier is encountered. Performance evaluations demonstrate that our proposed coordinated schedulers can improve the performance of synchronization-rich benchmarks by 10% on average when compared to the state-of-the-art non-coordinating schedulers.

[1]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[2]  Norman P. Jouppi,et al.  Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[3]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[4]  Xuhao Chen,et al.  Adaptive Cache Management for Energy-Efficient GPU Computing , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[5]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[6]  N. H. Kim Effect of Instruction Fetch and Memory Scheduling on GPU Performance , 2009 .

[7]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[8]  Mahmut T. Kandemir,et al.  Neither more nor less: Optimizing thread-level parallelism for GPGPUs , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[9]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[10]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[11]  Mattan Erez,et al.  A locality-aware memory hierarchy for energy-efficient GPU architectures , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[12]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[13]  Xiaoyuan Li,et al.  Guided Region-Based GPU Scheduling: Utilizing Multi-thread Parallelism to Hide Memory Latency , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[14]  John E. Stone,et al.  OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems , 2010, Computing in Science & Engineering.

[15]  Mike O'Connor,et al.  Divergence-Aware Warp Scheduling , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[16]  Brucek Khailany,et al.  CudaDMA: Optimizing GPU memory bandwidth via warp specialization , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[17]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Kevin Kai-Wei Chang,et al.  Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[19]  Wen-mei W. Hwu,et al.  Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing , 2012 .

[20]  John Kim,et al.  Improving GPGPU resource utilization through alternative thread block scheduling , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).