CAWS: Criticality-aware warp scheduling for GPGPU workloads

The ability to perform fast context-switching and massive multi-threading is the forte of modern GPU architectures, which have emerged as an efficient alternative to traditional chip-multiprocessors for parallel workloads. One of the main benefits of such architecture is its latency-hiding capability. However, the efficacy of GPU's latency-hiding varies significantly across GPGPU applications. To investigate this, this paper first proposes a new algorithm that profiles execution behavior of GPGPU applications. We characterize latencies caused by various pipeline hazards, memory accesses, synchronization primitives, and the warp scheduler. Our results show that the current round-robin warp scheduler works well in overlapping various latency stalls with the execution of other available warps for only a few GPGPU applications. For other applications, there is an excessive latency stall that cannot be hidden by the scheduler effectively. With the latency characterization insight, we observe a significant execution time disparity for warps within the same thread block, which causes suboptimal performance, called the warp criticality problem. To tackle the warp criticality problem, we design a family of criticality-aware warp scheduling (CAWS) policies by scheduling the critical warp(s) more frequently than other warps. Our results on the breadth-first-search, B+tree search, two point angular correlation function, and K-means clustering show that, with oracle knowledge of warp criticality, our best-performing scheduling policy can improve GPGPU applications' performance by 17% on average. With our designed criticality predictor, the various scheduling policies can improve performance by 10–21% on breadth-first-search. To our knowledge, this is the first paper to characterize warp criticality and explore different criticality-aware warp scheduling policies for GPGPU workloads.

[1]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[2]  Kevin Skadron,et al.  Dynamic warp subdivision for integrated branch and memory divergence tolerance , 2010, ISCA.

[3]  Sudhakar Yalamanchili,et al.  A characterization and analysis of PTX kernels , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[4]  Dong Hyuk Woo,et al.  SIMD divergence optimization through intra-warp compaction , 2013, ISCA.

[5]  Margaret Martonosi,et al.  MRPB: Memory request prioritization for massively parallel processors , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[6]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[7]  Henry Wong,et al.  Analyzing CUDA workloads using a detailed GPU simulator , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[8]  Mahmut T. Kandemir,et al.  Orchestrated scheduling and prefetching for GPGPUs , 2013, ISCA.

[9]  Mahmut T. Kandemir,et al.  OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance , 2013, ASPLOS '13.

[10]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[11]  Stijn Eyerman,et al.  Criticality stacks: identifying critical threads in parallel programs using synchronization behavior , 2013, ISCA.

[12]  MartonosiMargaret,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009 .

[13]  William J. Dally,et al.  Energy-efficient mechanisms for managing thread context in throughput processors , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[14]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[16]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Chris Fallin,et al.  Parallel application memory scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Mike O'Connor,et al.  Cache-Conscious Wavefront Scheduling , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Mattan Erez,et al.  The dual-path execution model for efficient GPU control flow , 2013, 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA).

[20]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[21]  References , 1971 .