Space-efficient implementation of nested parallelism

Many of today's high level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. This paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallel threads. The results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance.

[1]  Jyh-Herng Chow,et al.  Switch-stacks: A scheme for microtasking nested parallel loops , 1990, Proceedings SUPERCOMPUTING '90.

[2]  Steve R. Kleiman,et al.  SunOS Multi-thread Architecture , 1991, USENIX Winter.

[3]  V. Strassen Gaussian elimination is not optimal , 1969 .

[4]  K. Mani Chandy,et al.  Compositional C++: Compositional Parallel Programming , 1992, LCPC.

[5]  Guy E. Blelloch,et al.  A Framework for Space and Time Efficient Scheduling of Parallelism , 1996 .

[6]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[7]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[8]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[9]  Stephen A. Cook,et al.  A Taxonomy of Problems with Fast Parallel Algorithms , 1985, Inf. Control..

[10]  Jeffrey S. Chase,et al.  The Amber system: parallel programming on a network of multiprocessors , 1989, SOSP '89.

[11]  John Sargeant,et al.  Control of parallelism in the Manchester Dataflow Machine , 1987, FPCA.

[12]  F. Warren Burton,et al.  Space Efficient Execution of Deterministic Parallel Programs , 1999, IEEE Trans. Software Eng..

[13]  Rishiyur S. Nikhil,et al.  Cid: A Parallel, "Shared-Memory" C for Distributed-Memory Machines , 1994, LCPC.

[14]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[15]  L. Greengard The Rapid Evaluation of Potential Fields in Particle Systems , 1988 .

[16]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[17]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[18]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[19]  Anoop Gupta,et al.  COOL: An object-based language for parallel programming , 1994, Computer.

[20]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[21]  John Mellor-Crummey Concurrent Queues: Practical Fetch-and-Phi Algorithms. , 1987 .

[22]  Wilson C. Hsieh,et al.  Computation migration: enhancing locality for distributed-memory parallel systems , 1993, PPOPP '93.

[23]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[24]  F. Warren Burton Storage Management in Virtual Tree Machines , 1988, IEEE Trans. Computers.

[25]  K. K. Nambiar,et al.  Foundations of Computer Science , 2001, Lecture Notes in Computer Science.

[26]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[27]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[28]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[29]  Gregory R. Andrews,et al.  Distributed filaments: efficient fine-grain parallelism on a cluster of workstations , 1994, OSDI '94.

[30]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[31]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[32]  Anne Rogers,et al.  Supporting dynamic data structures on distributed-memory machines , 1995, TOPL.

[33]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[34]  Guy E. Blelloch,et al.  Provably efficient scheduling for languages with fine-grained parallelism , 1995, SPAA '95.

[35]  John H. Reif,et al.  Prototyping parallel and distributed programs in Proteus , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[36]  Seth Copen,et al.  ENABLING PRIMITIVES FOR COMPILING PARALLEL LANGUAGES , 1995 .

[37]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..