The Cilk system for parallel multithreaded computing

Although cost-effective parallel machines are now commercially available, the widespread use of parallel processing is still being held back, due mainly to the difficulty of parallel programming. In particular, it is still difficult to build efficient implementations of parallel applications whose communication patterns are either highly irregular or dependent upon dynamic information. Multithreading has become an increasingly popular way to implement these dynamic, asynchronous, concurrent programs. Cilk (pronounced "silk") is our C-based multithreaded computing system that provides provably good performance guarantees. This thesis describes the evolution of the Cilk language and runtime system, as well as applications which affected the evolution of the system. Using Cilk, programmers are able to express their applications either by writing multithreaded code written in a continuation-passing style, or by writing code using normal call/return semantics and specifying which calls can be performed in parallel. The Cilk runtime system takes complete control of the scheduling, load-balancing, and communication needed to execute the program, thereby insulating the programmer from these details. The programmer can rest assured that his program will be executed efficiently since the Cilk scheduler provably achieves time, space, and communication bounds all within a constant factor of optimal. For distributed memory environments, we have implemented a software shared memory system for Cilk. We have defined a "dag-consistent" memory model which is a lock-free consistency model well suited to the needs of a multithreaded program. Because dag consistency is a weak consistency model, we have been able to implement coherence efficiently in software. The $\star$Socrates computer chess program is the most complex application written in Cilk. $\star$Socrates is a large, nondeterministic, challenging application whose complex control dependencies make it inexpressible in many other parallel programming systems. Running on an 1824-node Paragon, $\star$Socrates finished second in the 1995 World Computer Chess Championship. Currently, versions of Cilk run on the Thinking Machines CM-5, the Intel Paragon, various SMPs, and on networks of workstations. The same Cilk program will run on all of these platforms with a little, if any, modification. Applications written in Cilk include protein folding, graphic rendering, backtrack search, and computer chess. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

[1]  Thinking Machines Getting started in cm-fortran , 1990 .

[2]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[3]  Carl Ebeling,et al.  Pattern Knowledge and Search: The SUPREM Architecture , 1989, Artif. Intell..

[4]  William Aiello,et al.  An atomic model for message-passing , 1993, SPAA '93.

[5]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[6]  Vivek Sarkar,et al.  Location consistency: stepping beyond the barriers of memory coherence and serializability , 1993 .

[7]  Anoop Gupta,et al.  COOL: An object-based language for parallel programming , 1994, Computer.

[8]  Robert D. Blumofe,et al.  Scheduling large-scale parallel computations on networks of workstations , 1994, Proceedings of 3rd IEEE International Symposium on High Performance Distributed Computing.

[9]  James R. Larus,et al.  LCM: memory system support for parallel language implementation , 1994, ASPLOS VI.

[10]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[11]  Donald F. Beal Round-by-Round , 1995, J. Int. Comput. Games Assoc..

[12]  Anne Rogers,et al.  Early Experiences with Olden , 1993, LCPC.

[13]  Guy E. Blelloch,et al.  NESL: A Nested Data-Parallel Language , 1992 .

[14]  Richard M. Karp,et al.  Randomized parallel algorithms for backtrack search and branch-and-bound computation , 1993, JACM.

[15]  W. Daniel Hillis,et al.  Data parallel algorithms , 1986, CACM.

[16]  Piyush Mehrotra,et al.  The BLAZE language: A parallel language for scientific programming , 1987, Parallel Comput..

[17]  Monica Beltrametti,et al.  The control mechanism for the Myrias parallel computer system , 1988, CARN.

[18]  Benjamin A. Dent,et al.  Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[19]  Brian N. Bershad,et al.  Scheduler activations: effective kernel support for the user-level management of parallelism , 1991, TOCS.

[20]  Gregory R. Andrews,et al.  Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[21]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[22]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[23]  Brian N. Bershad,et al.  Software write detection for a distributed shared memory , 1994, OSDI '94.

[24]  Allan Porterfield,et al.  Exploiting heterogeneous parallelism on a multithreaded multiprocessor , 1992, ICS '92.

[25]  Robert C. Miller,et al.  A type-checking preprocessor for Cilk 2, a multithreaded C language , 1995 .

[26]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[27]  Burkhard Monien,et al.  Distributed Game Tree Search on a Massively Parallel System , 1992, Data Structures and Efficient Algorithms.

[28]  Bradley C. Kuszmaul,et al.  Synchronized MIMD computing , 1994 .

[29]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[30]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[31]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.

[32]  V. Pande,et al.  Enumerations of the Hamiltonian walks on a cubic sublattice , 1994 .

[33]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[34]  Nader Vasseghi,et al.  The Mips R4000 processor , 1992, IEEE Micro.

[35]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[36]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1989, TOCS.

[37]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[38]  Shuichi Sakai,et al.  Prototype implementation of a highly parallel dataflow machine EM-4 , 1991, [1991] Proceedings. The Fifth International Parallel Processing Symposium.

[39]  Matteo Frigo,et al.  DAG-consistent distributed shared memory , 1996, Proceedings of International Conference on Parallel Processing.

[40]  Anant Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, ISCA '94.

[41]  V. Karamcheti,et al.  Concert-efficient runtime support for concurrent object-oriented programming languages on stock hardware , 1993, Supercomputing '93.

[42]  Yanjun Zhang,et al.  The efficiency of randomized parallel backtrack search , 1994, Proceedings of 1994 6th IEEE Symposium on Parallel and Distributed Processing.

[43]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[44]  Donald E. Knuth,et al.  The Solution for the Branching Factor of the Alpha-Beta Pruning Algorithm , 1981, ICALP.

[45]  K. Kennedy,et al.  Preliminary experiences with the Fortran D compiler , 1993, Supercomputing '93.

[46]  Udi Manber,et al.  DIB—a distributed implementation of backtracking , 1987, TOPL.

[47]  V. Strassen Gaussian elimination is not optimal , 1969 .

[48]  Peter J. Denning,et al.  Operating Systems Theory , 1973 .

[49]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[50]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[51]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[52]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[53]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[54]  Vaidy S. Sunderam,et al.  PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[55]  Henri E. Bal,et al.  Programming a distributed system using shared objects , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[56]  Suresh Jagannathan,et al.  A customizable substrate for concurrent languages , 1992, PLDI '92.

[57]  Judea Pearl,et al.  Asymptotic Properties of Minimax Trees and Game-Searching Procedures , 1980, Artif. Intell..

[58]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[59]  John P. Fishburn,et al.  Parallelism in Alpha-Beta Search , 1982, Artif. Intell..

[60]  Rishiyur S. Nikhil,et al.  A Multithreaded Implementation of Id using P-RISC Graphs , 1993, LCPC.

[61]  H. T. Kung,et al.  Communication complexity for parallel divide-and-conquer , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[62]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[63]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[64]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[65]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[66]  Burkhard Monien,et al.  Studying overheads in massively parallel MIN/MAX-tree evaluation , 1994, SPAA '94.

[67]  Alan L. Cox,et al.  TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems , 1994, USENIX Winter.

[68]  Piet Hut,et al.  A hierarchical O(N log N) force-calculation algorithm , 1986, Nature.

[69]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[70]  Gregory R. Andrews,et al.  Filaments: Efficient Support for Fine-Grain Parallelism , 1993 .

[71]  Robert H. Halstead,et al.  Implementation of multilisp: Lisp on a multiprocessor , 1984, LFP '84.

[72]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[73]  David E. Culler,et al.  A case for NOW (networks of workstation) , 1995, PODC '95.

[74]  P. Stenstrom VLSI support for a cactus stack oriented memory organization , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track.

[75]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[76]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[77]  Joel Moses The function of FUNCTION in LISP or why the FUNARG problem should be called the environment problem , 1970, SIGS.

[78]  Randy H. Katz,et al.  Implementing a cache consistency protocol , 1985, ISCA '85.

[79]  Wilson C. Hsieh,et al.  Computation migration: enhancing locality for distributed-memory parallel systems , 1993, PPOPP '93.