Using memory mapping to support cactus stacks in work-stealing runtime systems

Many multithreaded concurrency platforms that use a work-stealing runtime system incorporate a “cactus stack,” wherein a function's accesses to stack variables properly respect the function's calling ancestry, even when many of the functions operate in parallel. Unfortunately, such existing concurrency platforms fail to satisfy at least one of the following three desirable criteria: † full interoperability with legacy or third-party serial binaries that have been compiled to use an ordinary linear stack, † a scheduler that provides near-perfect linear speedup on applications with sufficient parallelism, and † bounded and efficient use of memory for the cactus stack. We have addressed this cactus-stack problem by modifying the Linux operating system kernel to provide support for thread-local memory mapping (TLMM). We have used TLMM to reimplement the cactus stack in the open-source Cilk-5 runtime system. The Cilk-M runtime system removes the linguistic distinction imposed by Cilk-5 between serial code and parallel code, erases Cilk-5's limitation that serial code cannot call parallel code, and provides full compatibility with existing serial calling conventions. The Cilk-M runtime system provides strong guarantees on scheduler performance and stack space. Benchmark results indicate that the performance of the prototype Cilk-M 1.0 is comparable to the Cilk 5.4.6 system, and the consumption of stack space is modest.

[1]  Gregory R. Andrews,et al.  Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[2]  Michael L. Scott,et al.  Scalable reader-writer synchronization for shared-memory multiprocessors , 1991, PPOPP '91.

[3]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[4]  References , 1971 .

[5]  Robert D. Blumofe,et al.  Hood: A user-level threads library for multiprogrammed multiprocessors , 1998 .

[6]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[7]  C. H. Flood,et al.  The Fortress Language Specification , 2007 .

[8]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[9]  Charles E. Leiserson,et al.  The Cilk++ concurrency platform , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[10]  Seth Copen,et al.  ENABLING PRIMITIVES FOR COMPILING PARALLEL LANGUAGES , 1995 .

[11]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[12]  Richard M. Karp,et al.  Randomized parallel algorithms for backtrack search and branch-and-bound computation , 1993, JACM.

[13]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[14]  Victor Luchangco,et al.  The Fortress Language Specification Version 1.0 , 2007 .

[15]  D. B. Davis,et al.  Sun Microsystems Inc. , 1993 .

[16]  Bjarne Stroustrup,et al.  C++ Programming Language , 1986, IEEE Softw..

[17]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[18]  Benjamin A. Dent,et al.  Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[19]  James Reinders,et al.  Intel threading building blocks - outfitting C++ for multi-core processor parallelism , 2007 .

[20]  Matteo Frigo,et al.  Reducers and other Cilk++ hyperobjects , 2009, SPAA '09.

[21]  James R. Larus,et al.  Rewriting executable files to measure program behavior , 1994, Softw. Pract. Exp..

[22]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[23]  Burkhard Monien,et al.  Studying overheads in massively parallel MIN/MAX-tree evaluation , 1994, SPAA '94.

[24]  Devang Shah,et al.  Implementing Lightweight Threads , 1992, USENIX Summer.

[25]  Robert H. Halstead,et al.  Mul-T: a high-performance parallel Lisp , 1989, PLDI '89.

[26]  Bradley C. Kuszmaul,et al.  Synchronized MIMD computing , 1994 .

[27]  Doug Lea,et al.  A Java fork/join framework , 2000, JAVA '00.

[28]  Sebastian Burckhardt,et al.  The design of a task parallel library , 2009, OOPSLA.

[29]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[30]  Eric S. Roberts,et al.  WorkCrews: An abstraction for controlling parallelism , 2005, International Journal of Parallel Programming.

[31]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[32]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[33]  Jim Sukha Brief announcement: a lower bound for depth-restricted work stealing , 2009, SPAA '09.

[34]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[35]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.