The implementation of the Cilk-5 multithreaded language

The fifth release of the multithreaded language Cilk uses a provably good "work-stealing" scheduling algorithm similar to the first system, but the language has been completely redesigned and the runtime system completely reengineered. The efficiency of the new implementation was aided by a clear strategy that arose from a theoretical analysis of the scheduling algorithm: concentrate on minimizing overheads that contribute to the work, even at the expense of overheads that contribute to the critical path. Although it may seem counterintuitive to move overheads onto the critical path, this "work-first" principle has led to a portable Cilk-5 implementation in which the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. This paper describes how the work-first principle was exploited in the design of Cilk-5's compiler and its runtime system. In particular, we present Cilk-5's novel "two-clone" compilation strategy and its Dijkstra-like mutual-exclusion protocol for implementing the ready deque in the work-stealing scheduler.

[1]  Joel Moses The function of FUNCTION in LISP or why the FUNARG problem should be called the environment problem , 1970, SIGS.

[2]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[3]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[4]  Guillermo J. Rozas,et al.  Garbage Collection is Fast, but a Stack is Faster , 1994 .

[5]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[6]  Dirk Grunwald,et al.  Whole-program optimization for time and space efficient threads , 1996, ASPLOS VII.

[7]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[8]  Per Stenström,et al.  VLSI support for a cactus stack oriented memory organization , 1988 .

[9]  Rishiyur S. Nikhil,et al.  Parallel Symbolic Computing in Cid , 1995, PSLS.

[10]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[11]  P. Stenstrom VLSI support for a cactus stack oriented memory organization , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track.

[12]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[13]  Charles E. Leiserson,et al.  Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[14]  Marc Feeley Polling efficiently on stock hardware , 1993, FPCA '93.

[15]  Andrew W. Appel,et al.  Empirical and Analytic Study of Stack Versus Heap Cost for Languages with Closures , 1996, J. Funct. Program..

[16]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[17]  Christopher F. Joerg,et al.  The Cilk system for parallel multithreaded computing , 1996 .

[18]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[19]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[20]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[21]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA.

[22]  Charles E. Leiserson,et al.  Detecting data races in Cilk programs that use locks , 1998, SPAA '98.

[23]  Robert C. Miller,et al.  A type-checking preprocessor for Cilk 2, a multithreaded C language , 1995 .

[24]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[25]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[26]  Dirk Grunwald Heaps o' Stacks: Time and Space Efficient Threads Without Operating System Support , 1994 .

[27]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.