GPU Schedulers: How Fair Is Fair Enough?

Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do run on today’s GPUs, reasoning about liveness properties of such applications is difficult as documentation is scarce and scattered. In this work, we aim to clarify fairness properties of semi-fair schedulers. To do this, we define a general temporal logic formula, based on weak fairness, parameterised by a predicate that enables fairness per-thread at certain points of an execution. We then define fairness properties for three GPU schedulers: HSA, OpenCL, and occupancy-bound execution. We examine existing GPU applications and show that none of the above schedulers are strong enough to provide the fairness properties required by these applications. It hence appears that existing GPU scheduler descriptions do not entirely capture the fairness properties that are provided on current GPUs. Thus, we present two new schedulers that aim to support existing GPU applications. We analyse the behaviour of common blocking idioms under each scheduler and show that one of our new schedulers allows a more natural implementation of a GPU protocol. 2012 ACM Subject Classification Software and its engineering → Semantics, Software and its engineering → Scheduling, Computing methodologies → Graphics processors

[1]  Alastair F. Donaldson,et al.  Exposing errors related to weak memory in GPU applications , 2016, PLDI.

[2]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[3]  Dong Li,et al.  Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations , 2015, ICS.

[4]  Christel Baier,et al.  Principles of model checking , 2008 .

[5]  Amir Pnueli,et al.  The Glory of the Past , 1985, Logic of Programs.

[6]  D. M. Hutton,et al.  The Art of Multiprocessor Programming , 2008 .

[7]  Shengen Yan,et al.  StreamScan: fast scan algorithms for GPUs without global barrier synchronization , 2013, PPoPP '13.

[8]  Joseph L. Greathouse,et al.  Structural Agnostic SpMV: Adapting CSR-Adaptive for Irregular Matrices , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).

[9]  John Wickerson,et al.  The Design and Implementation of a Verification Technique for GPU Kernels , 2015, TOPL.

[10]  Wen-mei W. Hwu,et al.  GPU Computing Gems Jade Edition , 2011 .

[11]  Wu-chun Feng,et al.  Inter-block GPU communication via fast barrier synchronization , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[12]  Brian Vinter,et al.  A Synchronization-Free Algorithm for Parallel Sparse Triangular Solves , 2016, Euro-Par.

[13]  Keshav Pingali,et al.  A compiler for throughput optimization of graph algorithms on GPUs , 2016, OOPSLA.

[14]  Alastair F. Donaldson,et al.  The Hitchhiker's Guide to Cross-Platform OpenCL Application Development , 2016, IWOCL.

[15]  Ganesh Gopalakrishnan,et al.  Portable inter-workgroup barrier synchronisation for GPUs , 2016, OOPSLA.