Heartbeat scheduling: provable efficiency for nested parallelism

A classic problem in parallel computing is to take a high-level parallel program written, for example, in nested-parallel style with fork-join constructs and run it efficiently on a real machine. The problem could be considered solved in theory, but not in practice, because the overheads of creating and managing parallel threads can overwhelm their benefits. Developing efficient parallel codes therefore usually requires extensive tuning and optimizations to reduce parallelism just to a point where the overheads become acceptable. In this paper, we present a scheduling technique that delivers provably efficient results for arbitrary nested-parallel programs, without the tuning needed for controlling parallelism overheads. The basic idea behind our technique is to create threads only at a beat (which we refer to as the "heartbeat") and make sure to do useful work in between. We specify our heartbeat scheduler using an abstract-machine semantics and provide mechanized proofs that the scheduler guarantees low overheads for all nested parallel programs. We present a prototype C++ implementation and an evaluation that shows that Heartbeat competes well with manually optimized Cilk Plus codes, without requiring manual tuning.

[1]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[2]  Marc Feeley Polling efficiently on stock hardware , 1993, FPCA '93.

[3]  Arthur Charguéraud,et al.  Oracle scheduling: controlling granularity in implicitly parallel languages , 2011, OOPSLA '11.

[4]  Guy E. Blelloch,et al.  A provably time-efficient parallel implementation of full speculation , 1999, TOPL.

[5]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[6]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[7]  Christos Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS 2010.

[8]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[9]  Alexandros Tzannes,et al.  Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism , 2014, TOPL.

[10]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[11]  Charles E. Leiserson,et al.  On-the-Fly Pipeline Parallelism , 2015, ACM Trans. Parallel Comput..

[12]  Suresh Jagannathan,et al.  MultiMLton: A multicore-aware runtime for standard ML , 2014, J. Funct. Program..

[13]  David Chase,et al.  Dynamic circular work-stealing deque , 2005, SPAA '05.

[14]  Arthur Charguéraud,et al.  Oracle-guided scheduling for controlling granularity in implicitly parallel languages* , 2016, Journal of Functional Programming.

[15]  Guy E. Blelloch,et al.  Effectively sharing a cache among threads , 2004, SPAA '04.

[16]  Benjamin A. Dent,et al.  Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[17]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[18]  Seth Copen Goldstein,et al.  Enabling Primitives for Compiling Parallel Languages , 1995, LCR.

[19]  Guy E. Blelloch,et al.  Space-efficient scheduling of nested parallelism , 1999, TOPL.

[20]  John M. Mellor-Crummey,et al.  A Practical Solution to the Cactus Stack Problem , 2016, SPAA.

[21]  Alexandros Tzannes,et al.  10 Lazy Scheduling: A Runtime Adaptive Scheduler , 2014 .

[22]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[23]  Christoforos E. Kozyrakis,et al.  Flexible architectural support for fine-grain scheduling , 2010, ASPLOS XV.

[24]  Guy E. Blelloch,et al.  Beyond nested parallelism: tight bounds on work-stealing overheads for parallel futures , 2009, SPAA '09.

[25]  Saumya K. Debray,et al.  A Methodology for Granularity-Based Control of Parallelism in Logic Programs , 1996, J. Symb. Comput..

[26]  Umut A. Acar,et al.  Hierarchical memory management for mutable state , 2018, PPOPP.

[27]  Taiichi Yuasa,et al.  Backtracking-based load balancing , 2009, PPoPP '09.

[28]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[29]  Guy E. Blelloch,et al.  The data locality of work stealing , 2000, SPAA.

[30]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[31]  Simon Marlow,et al.  Parallel and Concurrent Programming in Haskell , 2013, CEFP.

[32]  Vijaya Ramachandran,et al.  Cache-efficient dynamic programming algorithms for multicores , 2008, SPAA '08.

[33]  James R. Larus,et al.  Using the run-time sizes of data structures to guide parallel-thread creation , 1994, LFP '94.

[34]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA.

[35]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[36]  Edward D. Lazowska,et al.  Speedup Versus Efficiency in Parallel Systems , 1989, IEEE Trans. Computers.

[37]  Kenjiro Taura,et al.  A static cut-off for task parallel programs , 2016, 2016 International Conference on Parallel Architecture and Compilation Techniques (PACT).

[38]  Alejandro Duran,et al.  An adaptive cut-off for task parallelism , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[39]  J. S. Weening Parallel execution of LISP programs , 1990 .

[40]  Vivek Sarkar,et al.  Habanero-Java library: a Java 8 framework for multicore programming , 2014, PPPJ.

[41]  Vivek Sarkar,et al.  Deadlock-free scheduling of X10 computations with bounded resources , 2007, SPAA '07.

[42]  Silas Boyd-Wickizer,et al.  Using memory mapping to support cactus stacks in work-stealing runtime systems , 2010, 2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT).

[43]  Guy E. Blelloch,et al.  Hierarchical memory management for parallel programs , 2016, ICFP.

[44]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[45]  Guy E. Blelloch,et al.  Coupling Memory and Computation for Locality Management , 2015, SNAPL.

[46]  Arthur Charguéraud,et al.  Scheduling parallel programs by work stealing with private deques , 2013, PPoPP '13.

[47]  Marc Feeley,et al.  A Message Passing Implementation of Lazy Task Creation , 1992, Parallel Symbolic Computing.

[48]  Alexandros Tzannes,et al.  Lazy binary-splitting: a run-time adaptive work-stealing scheduler , 2010, PPoPP '10.

[49]  Guy E. Blelloch,et al.  Brief announcement: the problem based benchmark suite , 2012, SPAA '12.

[50]  Seth Copen,et al.  ENABLING PRIMITIVES FOR COMPILING PARALLEL LANGUAGES , 1995 .

[51]  David R. O'Hallaron,et al.  Languages, Compilers and Run-Time Systems for Scalable Computers , 1998, Springer US.

[52]  Guy E. Blelloch,et al.  Scheduling irregular parallel computations on hierarchical caches , 2011, SPAA '11.

[53]  Charles E. Leiserson,et al.  Space-Efficient Scheduling of Multithreaded Computations , 1998, SIAM J. Comput..

[54]  Joseph S. Weening,et al.  Low-Cost Process Creation and Dynamic Partitioning in Qlisp , 1989, Workshop on Parallel Lisp.

[55]  Guy E. Blelloch,et al.  Internally deterministic parallel algorithms can be fast , 2012, PPoPP '12.

[56]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1999, JACM.

[57]  John H. Reppy,et al.  Implicitly-threaded parallelism in Manticore , 2008, Journal of Functional Programming.