Cilk: efficient multithreaded computing

This thesis describes Cilk, a parallel multithreaded language for programming contemporary shared memory multiprocessors (SMP's). Cilk is a simple extension of C which provides constructs for parallel control and synchronization. Cilk imposes very low overheads | the typical cost of spawning a parallel thread is only between 2 and 6 times the cost of a C function call on a variety of contemporary machines. Many Cilk programs run on one processor with virtually no degradation compared to equivalent C programs. We present the \workrst" principle which guided the design of Cilk's scheduler and two consequences of this principle, a novel \two-clone" compilation strategy and a Dijkstra-like mutual-exclusion protocol for implementing the ready queue in the work-stealing scheduler. To facilitate debugging of Cilk programs, Cilk provides a tool called the Nondeterminator-2 which nds nondeterministic bugs called \data races". We present two algorithms, All-Sets and Brelly, used by the Nondeterminator-2 for nding data races. The All-Sets algorithm is exact but can sometimes have poor performance; theBrelly algorithm, by imposing a locking discipline on the programmer, is guaranteed to run in nearly linear time. For a program that runs serially in time T , accesses V shared memory locations, and holds at most k locks simultaneously, Brelly runs in O(kT (V; V )) time and O(kV ) space, where is Tarjan's functional inverse of Ackermann's function. Cilk can be run on clusters of SMP's as well. We de ne a novel weak memory model called \dag consistency" which provides a natural consistency model for use with multithreaded languages like Cilk. We provide evidence that Backer, the protocol that implements dag consistency, is both empirically and theoretically e cient. In particular, we prove bounds on running time and communication for a Cilk program executed on top of Backer, including all costs associated with Backer itself. We believe this proof is the rst of its kind in this regard. Finally, we present the MultiBacker protocol for clusters of SMP's which extends Backer to take advantage of hardware support for shared memory within an SMP. Thesis Supervisor: Charles E. Leiserson Title: Professor of Computer Science and Engineering

[1]  P. Stenstrom VLSI support for a cactus stack oriented memory organization , 1988, [1988] Proceedings of the Twenty-First Annual Hawaii International Conference on System Sciences. Volume I: Architecture Track.

[2]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[3]  Charles E. Leiserson,et al.  Efficient Detection of Determinacy Races in Cilk Programs , 1997, SPAA '97.

[4]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[5]  Kirk L. Johnson,et al.  CRL: high-performance all-software distributed shared memory , 1995, SOSP.

[6]  Joel Moses The function of FUNCTION in LISP or why the FUNARG problem should be called the environment problem , 1970, SIGS.

[7]  Monica S. Lam,et al.  The design and evaluation of a shared object system for distributed memory machines , 1994, OSDI '94.

[8]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[9]  Edith Schonberg,et al.  Detecting access anomalies in programs with critical sections , 1991, PADD '91.

[10]  Mustaque Ahamad,et al.  Implementing and programming causal distributed shared memory , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[11]  Charles E. Leiserson,et al.  Detecting data races in Cilk programs that use locks , 1998, SPAA '98.

[12]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[13]  Andrew V. Goldberg,et al.  A new approach to the maximum flow problem , 1986, STOC '86.

[14]  Robert H. B. Netzer,et al.  Efficient Race Condition Detection for Shared-Memory Programs with Post/Wait Synchronization , 1992, International Conference on Parallel Processing.

[15]  James R. Larus,et al.  Tempest and typhoon: user-level shared memory , 1994, ISCA '94.

[16]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, LISP and Functional Programming.

[17]  David S. Wise Representing Matrices as Quadtrees for Parallel Processors , 1985, Inf. Process. Lett..

[18]  BeltramettiMonica,et al.  The control mechanism for the Myrias parallel computer system , 1988 .

[19]  Henri E. Bal,et al.  Programming a distributed system using shared objects , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[20]  David A. Padua,et al.  Event synchronization analysis for debugging parallel programs , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[21]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[22]  John M. Mellor-Crummey,et al.  On-the-fly detection of data races for programs with nested fork-join parallelism , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[23]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[24]  Leslie Lamport,et al.  How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs , 2016, IEEE Transactions on Computers.

[25]  Martin C. Rinard,et al.  Commutativity analysis: a new analysis framework for parallelizing compilers , 1996, PLDI '96.

[26]  James R. Goodman,et al.  Cache Consistency and Sequential Consistency , 1991 .

[27]  Guy E. Blelloch,et al.  Programming parallel algorithms , 1996, CACM.

[28]  Paul Hudak,et al.  Memory coherence in shared virtual memory systems , 1986, PODC '86.

[29]  Dirk Grunwald Heaps o' Stacks: Time and Space Efficient Threads Without Operating System Support , 1994 .

[30]  Peter J. Denning,et al.  Operating Systems Theory , 1973 .

[31]  Guillermo J. Rozas,et al.  Garbage Collection is Fast, but a Stack is Faster , 1994 .

[32]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[33]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[34]  Matteo Frigo,et al.  The weakest reasonable memory model , 1998 .

[35]  Michael Burrows,et al.  Eraser: a dynamic data race detector for multi-threaded programs , 1997, TOCS.

[36]  Benjamin A. Dent,et al.  Burroughs' B6500/B7500 stack mechanism , 1968, AFIPS '68 (Spring).

[37]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[38]  V. Strassen Gaussian elimination is not optimal , 1969 .

[39]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[40]  Matteo Frigo,et al.  The implementation of the Cilk-5 multithreaded language , 1998, PLDI.

[41]  Barton P. Miller,et al.  On the Complexity of Event Ordering for Shared-Memory Parallel Program Executions , 1990, ICPP.

[42]  Rishiyur S. Nikhil,et al.  Parallel Symbolic Computing in Cid , 1995, PSLS.

[43]  Srinivasan Parthasarathy,et al.  Cashmere-2L: software coherent shared memory on a clustered remote-write network , 1997, SOSP.

[44]  Dirk Grunwald,et al.  Whole-program optimization for time and space efficient threads , 1996, ASPLOS VII.

[45]  David Singmaster,et al.  Notes on Rubik's 'Magic Cube' , 1981 .

[46]  Andrew W. Appel,et al.  Empirical and Analytic Study of Stack Versus Heap Cost for Languages with Closures , 1996, J. Funct. Program..

[47]  Charles E. McDowell,et al.  Analyzing Traces with Anonymous Synchronization , 1989, ICPP.

[48]  Vivek Sarkar,et al.  Location Consistency: Stepping Beyond the Memory Coherence Barrier , 1995, ICPP.

[49]  Brian N. Bershad,et al.  Software write detection for a distributed shared memory , 1994, OSDI '94.

[50]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[51]  Richard F. Barrett,et al.  Matrix Market: a web resource for test matrix collections , 1996, Quality of Numerical Software.

[52]  Willy Zwaenepoel,et al.  Implementation and performance of Munin , 1991, SOSP '91.

[53]  Seth Copen Goldstein,et al.  Lazy Threads: Implementing a Fast Parallel Call , 1996, J. Parallel Distributed Comput..

[54]  James R. Larus,et al.  LCM: memory system support for parallel language implementation , 1994, ASPLOS VI.

[55]  Jong-Deok Choi,et al.  A Mechanism for Efficient Debugging of Parallel Programs , 1988, PLDI.

[56]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[57]  Victor Luchangco,et al.  Computation-centric memory models , 1998, SPAA '98.

[58]  Anoop Gupta,et al.  The Stanford FLASH multiprocessor , 1994, ISCA '94.

[59]  Jong-Deok Choi,et al.  An efficient cache-based access anomaly detection scheme , 1991, ASPLOS IV.

[60]  Gregory R. Andrews,et al.  Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[61]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[62]  M. Hill,et al.  Weak ordering-a new definition , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[63]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[64]  Edsger W. Dijkstra,et al.  Solution of a problem in concurrent programming control , 1965, CACM.

[65]  G. Andrew Boughton Arctic Routing Chip , 1994, PCRCW.

[66]  Monica Beltrametti,et al.  The control mechanism for the Myrias parallel computer system , 1988, CARN.

[67]  Mark D. Hill,et al.  Weak ordering—a new definition , 1998, ISCA '98.

[68]  Robert C. Miller,et al.  A type-checking preprocessor for Cilk 2, a multithreaded C language , 1995 .

[69]  James C. Hoe StarT-X - A One-Man-Year Exercise in Network Interface Engineering , 1998 .

[70]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[71]  Peter J. Keleher,et al.  Online data-race detection via coherency guarantees , 1996, OSDI '96.

[72]  Richard C. Holt,et al.  Some deadlock properties of computer systems , 1971, SOSP '71.

[73]  Christopher F. Joerg,et al.  The Cilk system for parallel multithreaded computing , 1996 .

[74]  C. Greg Plaxton,et al.  Thread Scheduling for Multiprogrammed Multiprocessors , 1998, SPAA '98.

[75]  Edith Schonberg,et al.  An empirical comparison of monitoring algorithms for access anomaly detection , 2011, PPOPP '90.

[76]  Matteo Frigo,et al.  DAG-consistent distributed shared memory , 1996, Proceedings of International Conference on Parallel Processing.

[77]  Anant Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, ISCA '94.

[78]  Richard M. Karp,et al.  Parallel Algorithms for Shared-Memory Machines , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[79]  Robert E. Tarjan,et al.  Applications of Path Compression on Balanced Trees , 1979, JACM.

[80]  Victor Luchangco,et al.  Precedence-Based Memory Models , 1997, WDAG.

[81]  Piyush Mehrotra,et al.  The BLAZE language: A parallel language for scientific programming , 1987, Parallel Comput..

[82]  A. Agarwal,et al.  Software-extended coherent shared memory: performance and cost , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[83]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[84]  W. Daniel Hillis,et al.  The Network Architecture of the Connection Machine CM-5 , 1996, J. Parallel Distributed Comput..

[85]  Anoop Gupta,et al.  Memory consistency and event ordering in scalable shared-memory multiprocessors , 1990, ISCA '90.

[86]  James R. Larus,et al.  Fine-grain access control for distributed shared memory , 1994, ASPLOS VI.