Space-Efficient Scheduling of Multithreaded Computations

This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single-processor execution. Utilizing a new graph-theoretic model of multithreaded computation, execution efficiency is quantified by three important measures: T1 is the time required for executing the computation on a 1 processor, $T_\infty$ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on a 1 processor. A computation executed on P processors is time-efficient if the time is $O(T_1/P + T_\infty)$, that is, it achieves linear speedup when $P=O(T_1/T_\infty)$, and it is space-efficient if it uses O(S1P) total space, that is, the space per processor is within a constant factor of that required for a 1-processor execution. The first result derived from this model shows that there exist multithreaded computations such that no execution schedule can simultaneously achieve efficient time and efficient space. But by restricting attention to "strict" computations---those in which all arguments to a procedure must be available before the procedure can be invoked---much more positive results are obtainable. Specifically, for any strict multithreaded computation, a simple online algorithm can compute a schedule that is both time-efficient and space-efficient. Unfortunately, because the algorithm uses a global queue, the overhead of computing the schedule can be substantial. This problem is overcome by a decentralized algorithm that can compute and execute a P-processor schedule online in expected time $O(T_1/P + T_\infty\lg P)$ and worst-case space $O(S_1P\lg P)$, including overhead costs.

[1]  William J. Dally,et al.  Processor coupling: integrating compile time and runtime scheduling for parallelism , 1992, ISCA '92.

[2]  Frank Thomson Leighton,et al.  Dynamic tree embeddings in butterflies and hypercubes , 1989, SPAA '89.

[3]  David E. Culler,et al.  Managing parallelism and resources in scientific dataflow programs , 1989 .

[4]  Seth Copen Goldstein,et al.  Active messages: a mechanism for integrating communication and computation , 1998, ISCA '98.

[5]  David S. Greenberg,et al.  Tight bounds for on-line tree embeddings , 1991, SODA '91.

[6]  Bradley C. Kuszmaul,et al.  Synchronized MIMD computing , 1994 .

[7]  Abhiram G. Ranade Optimal speedup for backtrack search on a butterfly network , 1991, SPAA '91.

[8]  R. Karp,et al.  Parallel Algorithms for Combinatorial Search Problems , 1989 .

[9]  R. D. Blumofe MANAGING STORAGE FOR MULTITHREADED COMPUTATIONS , 1992 .

[10]  Ronald L. Graham,et al.  Bounds on Multiprocessing Timing Anomalies , 1969, SIAM Journal of Applied Mathematics.

[11]  K. R. Traub,et al.  Sequential implementation of lenient programming languages , 1988 .

[12]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[13]  D. E. Culler,et al.  RESOURCE MANAGEMENT FOR THE TAGGED TOKEN DATAFLOW ARCHITECTURE , 1985 .

[14]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[15]  Robert D. Blumofe,et al.  Executing multithreaded programs efficiently , 1995 .

[16]  Ronald L. Graham,et al.  Bounds for certain multiprocessing anomalies , 1966 .

[17]  Charles L. Seitz,et al.  Multicomputers: message-passing concurrent computers , 1988, Computer.

[18]  Robert H. Halstead,et al.  Lazy task creation: a technique for increasing the granularity of parallel programs , 1990, IEEE Trans. Parallel Distributed Syst..

[19]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[20]  Suresh Jagannathan,et al.  A customizable substrate for concurrent languages , 1992, PLDI '92.

[21]  Bob Boothe,et al.  Improved multithreading techniques for hiding communication latency in multiprocessors , 1992, ISCA '92.

[22]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[23]  Mitsuhisa Sato,et al.  Thread-based programming for the EM-4 hybrid dataflow machine , 1992, ISCA '92.

[24]  Christos Kaklamanis,et al.  Branch-and-bound and backtrack search on mesh-connected arrays of processors , 1992, SPAA '92.

[25]  Robert D. Blumofe,et al.  Scheduling large-scale parallel computations on networks of workstations , 1994, Proceedings of 3rd IEEE International Symposium on High Performance Distributed Computing.

[26]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[27]  David E. Culler,et al.  Resource requirements of dataflow programs , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[28]  Richard P. Brent,et al.  The Parallel Evaluation of General Arithmetic Expressions , 1974, JACM.

[29]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[30]  V. Gerald Grafe,et al.  The Epsilon-2 hybrid dataflow architecture , 1990, Digest of Papers Compcon Spring '90. Thirty-Fifth IEEE Computer Society International Conference on Intellectual Leverage.

[31]  Robert H. Halstead,et al.  MASA: a multithreaded processor architecture for parallel symbolic computing , 1988, [1988] The 15th Annual International Symposium on Computer Architecture. Conference Proceedings.

[32]  Robert D. Blumofe,et al.  Scheduling multithreaded computations by work stealing , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[33]  Charles E. Leiserson,et al.  Space-efficient scheduling of multithreaded computations , 1993, SIAM J. Comput..

[34]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[35]  F. Warren Burton,et al.  Executing functional programs on a virtual tree of processors , 1981, FPCA '81.

[36]  Richard M. Karp,et al.  A randomized parallel branch-and-bound procedure , 1988, STOC '88.

[37]  John Sargeant,et al.  Control of parallelism in the Manchester Dataflow Machine , 1987, FPCA.

[38]  Eli Upfal,et al.  A simple load balancing scheme for task allocation in parallel machines , 1991, SPAA '91.

[39]  Robert A. Iannucci Toward a dataflow/von Neumann hybrid architecture , 1988, ISCA '88.

[40]  Prabhakar Raghavan,et al.  Probabilistic construction of deterministic algorithms: Approximating packing integer programs , 1986, 27th Annual Symposium on Foundations of Computer Science (sfcs 1986).

[41]  Andrew A. Chien,et al.  Architecture of a message-driven processor , 1987, ISCA '98.