Provably efficient scheduling for languages with fine-grained parallelism

Many high-level parallel programming languages allow for fine-grained parallelism. As in the popular work-time framework for parallel algorithm design, programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. A common concern in executing such programs is to schedule tasks to processors dynamically so as to minimize not only the execution time, but also the amount of space (memory) needed. Without careful scheduling, the parallel execution on <italic>p</italic> processors can use a factor of <italic>p</italic> or larger more space than a sequential implementation of the same program. This paper first identifies a class of parallel schedules that are provably efficient in both time and space. For any computation with <?Pub Fmt italic>w<?Pub Fmt /italic> units of work and critical path length <?Pub Fmt italic>d<?Pub Fmt /italic>, and for any sequential schedule that takes space s<subscrpt>1</subscrpt>, we provide a parallel schedule that takes fewer than w/p + d steps on p processors and requires less than s<subscrpt>1</subscrpt> + <inline-equation> <f> p˙d</f> </inline-equation> space. This matches the lower bound that we show, and significantly improves upon the best previous bound of <inline-equation> <f> s<inf>1</inf>˙p</f> </inline-equation> spaces for the common case where <italic>d</italic><<<italic>s</italic><subscrpt>1</subscrpt>. The paper then describes a scheduler for implementing high-level languages with <italic>nested</italic> parallelism, that generates schedules in this class. During program execution, as the structure of the computation is revealed, the scheduler keeps track of the active tasks, allocates the tasks to the processors, and performs the necessary task synchronization. The scheduler is itself a parallel algorithm, and incurs at most a constant factor overhead in time and space, even when the scheduling granularity is individual units of work. The algorithm is the first efficient solution to the scheduling problem discussed here, even if space considerations are ignored.

[1]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[2]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[3]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[4]  Edward G. Coffman,et al.  Computer and job-shop scheduling theory , 1976 .

[5]  Leslie G. Valiant,et al.  On Time Versus Space , 1977, JACM.

[6]  Carl Hewitt,et al.  The incremental garbage collection of processes , 1977 .

[7]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[8]  Uzi Vishkin,et al.  Parallel Dictionaries in 2-3 Trees , 1983, ICALP.

[9]  Patrick W. Dymond,et al.  Speedups of Deterministic Machines by Synchronous Parallel Machines , 1985, J. Comput. Syst. Sci..

[10]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[11]  Martin Tompa,et al.  A new pebble game that characterizes parallel complexity classes , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[12]  Guy E. Blelloch,et al.  Scans as Primitive Parallel Operations , 1989, ICPP.

[13]  John Sargeant,et al.  Control of parallelism in the Manchester Dataflow Machine , 1987, FPCA.

[14]  Christos H. Papadimitriou,et al.  A Communication-Time Tradeoff , 1987, SIAM J. Comput..

[15]  Abhiram G. Ranade,et al.  How to emulate shared memory , 1991, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[16]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[17]  F. Warren Burton Storage Management in Virtual Tree Machines , 1988, IEEE Trans. Computers.

[18]  Gary Sabot The paralation model - architecture-independent parallel programming , 1988 .

[19]  Bruce Leasure PCF programming model and FORTRAN bindings , 1989, [1989] Proceedings of the Thirteenth Annual International Computer Software & Applications Conference.

[20]  Keshav Pingali,et al.  I-structures: data structures for parallel computing , 1986, Graph Reduction.

[21]  Leslie G. Valiant,et al.  General Purpose Parallel Architectures , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[22]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[23]  David C. Cann,et al.  A Report on the Sisal Language Project , 1990, J. Parallel Distributed Comput..

[24]  Mihalis Yannakakis,et al.  Towards an Architecture-Independent Analysis of Parallel Algorithms , 1990, SIAM J. Comput..

[25]  Guy E. Blelloch,et al.  Vector Models for Data-Parallel Computing , 1990 .

[26]  F. Warren Burton,et al.  Applications of UET Scheduling Theory to the Implementation of Declarative Languages , 1990, Comput. J..

[27]  Ronald L. Rivest,et al.  Introduction to Algorithms , 1990 .

[28]  John H. Reif,et al.  Prototyping parallel and distributed programs in Proteus , 1991, Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing.

[29]  Uzi Vishkin,et al.  Towards a theory of nearly constant time parallel algorithms , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[30]  Uzi Vishkin,et al.  Converting high probability into nearly-constant time—with applications to parallel hashing , 1991, STOC '91.

[31]  Michael T. Goodrich Using approximation algorithms to design parallel algorithms that may ignore processor allocation , 1991, [1991] Proceedings 32nd Annual Symposium of Foundations of Computer Science.

[32]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[33]  Yossi Matias,et al.  Fast hashing on a PRAM—designing by expectation , 1991, SODA '91.

[34]  Suresh Jagannathan,et al.  A foundation for an efficient multi-threaded scheme system , 1992, LFP '92.

[35]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[36]  John H. Reif,et al.  Synthesis of Parallel Algorithms , 1993 .

[37]  Torben Hagerup,et al.  Fast deterministic processor allocation , 1993, SODA '93.

[38]  Dan Suciu,et al.  Efficient compilation of high-level data parallel algorithms , 1994, SPAA '94.

[39]  Uzi Vishkin,et al.  Optimal parallel approximation for prefix sums and integer sorting , 1994, SODA '94.

[40]  Joseph Gil Renaming and dispersing: Techniques for Fast Load Balancing , 1994, J. Parallel Distributed Comput..

[41]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[42]  Guy E. Blelloch,et al.  Implementation of a portable nested data-parallel language , 1993, PPOPP '93.

[43]  Bruce M. Maggs,et al.  Randomized Routing and Sorting on Fixed-Connection Networks , 1994, J. Algorithms.

[44]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[45]  Torben Hagerup,et al.  Fast Parallel Space Allocation, Estimation and Integer Sorting , 1995, Inf. Comput..

[46]  Uri Zwick,et al.  Optimal deterministic approximate parallel prefix sums and their applications , 1995, Proceedings Third Israel Symposium on the Theory of Computing and Systems.

[47]  Guy E. Blelloch,et al.  Parallelism in sequential functional languages , 1995, FPCA '95.

[48]  Yossi Matias,et al.  An Effective Load Balancing Policy for Geometric-Decaying Algorithms , 1996, J. Parallel Distributed Comput..

[49]  Dennis Gannon,et al.  Portable parallel programming in HPC++ , 1996, 1996 Proceedings ICPP Workshop on Challenges for Parallel Processing.

[50]  F. Warren Burton Guaranteeing Good Memory Bound for Parallel Programs , 1996, IEEE Trans. Software Eng..

[51]  Guy E. Blelloch,et al.  A provable time and space efficient implementation of NESL , 1996, ICFP '96.

[52]  Matteo Frigo,et al.  An analysis of dag-consistent distributed shared-memory algorithms , 1996, SPAA '96.

[53]  Guy E. Blelloch,et al.  Space-efficient implementation of nested parallelism , 1997, PPOPP '97.

[54]  Guy E. Blelloch,et al.  Space-efficient scheduling of parallelism with synchronization variables , 1997, SPAA '97.

[55]  F. Warren Burton,et al.  Space Efficient Execution of Deterministic Parallel Programs , 1999, IEEE Trans. Software Eng..

[56]  Girija J. Narlikar,et al.  Scheduling threads for low space requirement and good locality , 1999, SPAA '99.