Scheduling parallel programs by work stealing with private deques

Work stealing has proven to be an effective method for scheduling parallel programs on multicore computers. To achieve high performance, work stealing distributes tasks between concurrent queues, called deques, which are assigned to each processor. Each processor operates on its deque locally except when performing load balancing via steals. Unfortunately, concurrent deques suffer from two limitations: 1) local deque operations require expensive memory fences in modern weak-memory architectures, 2) they can be very difficult to extend to support various optimizations and flexible forms of task distribution strategies needed many applications, e.g., those that do not fit nicely into the divide-and-conquer, nested data parallel paradigm. For these reasons, there has been a lot recent interest in implementations of work stealing with non-concurrent deques, where deques remain entirely private to each processor and load balancing is performed via message passing. Private deques eliminate the need for memory fences from local operations and enable the design and implementation of efficient techniques for reducing task-creation overheads and improving task distribution. These advantages, however, come at the cost of communication. It is not known whether work stealing with private deques enjoys the theoretical guarantees of concurrent deques and whether they can be effective in practice. In this paper, we propose two work-stealing algorithms with private deques and prove that the algorithms guarantee similar theoretical bounds as work stealing with concurrent deques. For the analysis, we use a probabilistic model and consider a new parameter, the branching depth of the computation. We present an implementation of the algorithm as a C++ library and show that it compares well to Cilk on a range of benchmarks. Since our approach relies on private deques, it enables implementing flexible task creation and distribution strategies. As a specific example, we show how to implement task coalescing and steal-half strategies, which can be important in fine-grain, non-divide-and-conquer algorithms such as graph algorithms, and apply them to the depth-first-search problem.

[1]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[2]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[3]  Edward D. Lazowska,et al.  A Comparison of Receiver-Initiated and Sender-Initiated Adaptive Load Sharing , 1986, Perform. Evaluation.

[4]  Donald F. Towsley,et al.  Analysis of the Effects of Delays on Load Sharing , 1989, IEEE Trans. Computers.

[5]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[6]  Marc Feeley,et al.  A Message Passing Implementation of Lazy Task Creation , 1992, Parallel Symbolic Computing.

[7]  Marc Feeley,et al.  An efficient and general implementation of futures on large scale shared-memory multiprocessors , 1993 .

[8]  Marc Feeley Polling efficiently on stock hardware , 1993, FPCA '93.

[9]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[10]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[11]  Sivarama P. Dandamudi The effect of scheduling discipline on dynamic load sharing in heterogeneous distributed systems , 1997, Proceedings Fifth International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems.

[12]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[13]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[14]  Michael Mitzenmacher,et al.  Analyses of load stealing models based on differential equations , 1998, SPAA '98.

[15]  C. Leiserson,et al.  Scheduling multithreaded computations by work stealing , 1999, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[16]  Guy E. Blelloch,et al.  The Data Locality of Work Stealing , 2002, SPAA '00.

[17]  Nir Shavit,et al.  Work dealing , 2002, SPAA '02.

[18]  Nir Shavit,et al.  Non-blocking steal-half work queues , 2002, PODC '02.

[19]  Peter Sanders,et al.  Randomized Receiver Initiated Load-balancing Algorithms for Tree-shaped Computations , 2002, Comput. J..

[20]  Taiichi Yuasa,et al.  Pursuing Laziness for Efficient Implementation of Modern Multithreaded Languages , 2003, ISHPC.

[21]  Leslie Ann Goldberg,et al.  The Natural Work-Stealing Algorithm is Stable , 2001, SIAM J. Comput..

[22]  Mark Moir,et al.  A dynamic-sized nonblocking work stealing deque , 2006, Distributed Computing.

[23]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[24]  Stephen L. Olivier,et al.  Dynamic Load Balancing of Unbalanced Computations Using Message Passing , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[25]  Christopher J. Hughes,et al.  Carbon: architectural support for fine-grained parallelism on chip multiprocessors , 2007, ISCA '07.

[26]  Sriram Krishnamoorthy,et al.  Solving Large, Irregular Graph Problems Using Adaptive Work-Stealing , 2008, 2008 37th International Conference on Parallel Processing.

[27]  Guy E. Blelloch,et al.  Provably good multicore cache performance for divide-and-conquer algorithms , 2008, SODA '08.

[28]  John H. Reppy,et al.  Implicitly-threaded parallelism in Manticore , 2008, ICFP 2008.

[29]  Sriram Krishnamoorthy,et al.  Scalable work stealing , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[30]  Taiichi Yuasa,et al.  Backtracking-based load balancing , 2009, PPoPP '09.

[31]  Maged M. Michael,et al.  Idempotent work stealing , 2009, PPoPP '09.

[32]  Denis Trystram,et al.  A Tighter Analysis of Work Stealing , 2010, ISAAC.

[33]  Simon L. Peyton Jones,et al.  Regular, shape-polymorphic, parallel arrays in Haskell , 2010, ICFP '10.

[34]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[35]  David Cunningham,et al.  A performance model for X10 applications: what's going on under the hood? , 2011, X10 '11.

[36]  Alexandros Tzannes Enhancing Productivity and Performance Portability of General-Purpose Parallel Programming , 2012 .

[37]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[38]  Scheduling parallel programs by work stealing with private deques , 2013, PPOPP.