On the implementation and effectiveness of autoscheduling for shared-memory multiprocessors

This thesis addresses the problem of implementing the autoscheduling model of computation on conventional shared-memory multiprocessors. In autoscheduling, the partitioning and scheduling of computations for parallel execution are performed by means of drive code injected by the compiler at the entry and exit points of each schedulable unit (tasks). A prototype autoscheduling compiler that generates autoscheduling code for real and abstract multiprocessors was implemented. The general organization of this compiler and generated code are discussed in this thesis. The run-time library used by the executable autoscheduling code is also described. Major implementation problems include the execution of the actual scheduling operations, the organization of the task queue, granularity control to adjust the level of parallelism exploited, cactus-stack support, parallel loop implementation, support for data distribution, and execution on a time-variant partition of physical processors. The correctness and performance of the autoscheduling code generated by the compiler were verified through actual measurements on a real multiprocessor, program level execution-drive simulation, and instruction level simulation. The results demonstrate the feasibility of an autoscheduling compiler and its ability to exploit new levels of parallelism on shared-memory multiprocessors.

[1]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[2]  Vivek Sarkar PTRAN—the IBM parallel translation system , 1991 .

[3]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[4]  Sivarama P. Dandamudi,et al.  A Hierarchical Task Queue Organization for Shared-Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[5]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[6]  Jyh-Herng Chow,et al.  Switch-stacks: A scheme for microtasking nested parallel loops , 1990, Proceedings SUPERCOMPUTING '90.

[7]  Benjamin G. Zorn,et al.  Memory allocation costs in large C and C++ programs , 1994, Softw. Pract. Exp..

[8]  Alexandru Nicolau,et al.  Parallelizing Programs with Recursive Data Structures , 1989, IEEE Trans. Parallel Distributed Syst..

[9]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[10]  James R. Larus C**: A Large-Grain, Object-Oriented, Data-Parallel Programming Language , 1992, LCPC.

[11]  Ray Trimble Storage Management in IBM APL Systems , 1991, IBM Syst. J..

[12]  Thomas R. Gross,et al.  Exploiting task and data parallelism on a multicomputer , 1993, PPOPP '93.

[13]  Evangelos P. Markatos Scheduling for locality in shared-memory multiprocessors , 1993 .

[14]  Ron Y. Pinter,et al.  The parallel C (pC) programming language , 1991, IBM J. Res. Dev..

[15]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.

[16]  Constantine D. Polychronopoulos Multiprocessing versus Multiprogramming , 1989, ICPP.

[17]  Bill Nitzberg,et al.  Distributed shared memory: a survey of issues and algorithms , 1991, Computer.

[18]  Barbara M. Chapman,et al.  Handling Distributed Data in Vienna Fortran Procedures , 1992, LCPC.

[19]  Kenneth R. Traub,et al.  Multithreading: a revisionist view of dataflow architectures , 1991, ISCA '91.

[20]  Carl J. Beckmann,et al.  Hardware and software for functional and fine grain parallelism , 1993 .

[21]  G. N. Srinivasa Prasanna,et al.  Compile-time Techniques for Processor Allocation in Macro Dataflow Graphs for Multiprocessors , 1992, ICPP.

[22]  Ralph Duncan,et al.  A survey of parallel computer architectures , 1990, Computer.

[23]  Tao Yang,et al.  Clustering task graphs for message passing architectures , 1990, ICS '90.

[24]  John P. Hayes,et al.  Computer architecture and organization; (2nd ed.) , 1988 .

[25]  Tao Yang,et al.  A fast static scheduling algorithm for DAGs on an unbounded number of processors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[26]  David E. Culler,et al.  Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine , 1991, ASPLOS IV.

[27]  Ken Kennedy,et al.  Compiling Fortran 77D and 90D for MIMD distributed-memory machines , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[28]  L. Verlet Computer "Experiments" on Classical Fluids. I. Thermodynamical Properties of Lennard-Jones Molecules , 1967 .

[29]  B. Quentrec,et al.  New method for searching for neighbors in molecular dynamics computations , 1973 .

[30]  Gordon Bell,et al.  Ultracomputers: a teraflop before its time , 1992, CACM.

[31]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[32]  R. S. Nikhil Can dataflow subsume von Neumann computing? , 1989, ISCA '89.

[33]  Ken Kennedy,et al.  Interprocedural compilation of Fortran D for MIMD distributed-memory machines , 1992, Proceedings Supercomputing '92.

[34]  Stephen R. Goldschmidt,et al.  Simulation of multiprocessors: accuracy and performance , 1993 .

[35]  Dennis Gannon,et al.  Distributed pC++ Basic Ideas for an Object Parallel Language , 1993, Sci. Program..

[36]  Sachin S. Sapatnekar,et al.  A Convex Programming Approach for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[37]  Alfred V. Aho,et al.  The Transitive Reduction of a Directed Graph , 1972, SIAM J. Comput..

[38]  Kenji Nishida,et al.  Evaluation of a Prototype Data Flow Processor of the SIGMA-1 for Scientific Computations , 1986, ISCA.

[39]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[40]  Krishna M. Kavi,et al.  Parallelism in object-oriented languages: a survey , 1992, IEEE Software.

[41]  Thomas E. Anderson,et al.  The performance implications of thread management alternatives for shared-memory multiprocessors , 1989, SIGMETRICS '89.

[42]  David B. Loveman High performance Fortran , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[43]  Vivek Sarkar,et al.  A Concurrent Execution Semantics for Parallel Program Graphs and Program Dependence Graphs , 1992, LCPC.

[44]  Sandeep K. S. Gupta,et al.  On the Synthesis of Parallel Programs from Tensor Product Formulas for Block Recursive Algorithms , 1992, LCPC.

[45]  Eric Williams,et al.  Performance optimizations, implementation, and verification of the SGI Challenge multiprocessor , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[46]  Vivek Sarkar,et al.  Parallel Program Graphs and their Classification , 1993, LCPC.

[47]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[48]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[49]  Barbara M. Chapman,et al.  A Software Architecture for Multidisciplinary Applications: Integrating Task and Data Parallelism , 1994, CONPAR.

[50]  Robert A. Iannucci,et al.  A dataflow/von Neumann hybrid architecture , 1988 .

[51]  Thomas G. Macdonald,et al.  MPP Fortran Programming Model , 1992 .

[52]  Prithviraj Banerjee,et al.  Processor Allocation and Scheduling of Macro Dataflow Graphs on Distributed Memory Multicomputers by the PARADIGM Compiler , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[53]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[54]  Jaspal Subhlok Automatic Mapping of Task and Data Parallel Programs for Efficient Execution on Multicomputers , 1993 .

[55]  Ken Kennedy,et al.  Computer support for machine-independent parallel programming in Fortran D , 1992 .

[56]  Anoop Gupta,et al.  Making effective use of shared-memory multiprocessors: the process control approach , 1991 .

[57]  Janak H. Patel,et al.  A low-overhead coherence solution for multiprocessors with private cache memories , 1984, ISCA '84.

[58]  Geoffrey C. Fox,et al.  A Compilation Approach for Fortran 90D/HPF Compilers on Distributed Memory MIMD Computers , 1993 .

[59]  David K. Poulsen Memory latency reduction via data prefetching and data forwarding in shared memory multiprocessors , 1994 .

[60]  J. Ramanujam,et al.  Compile-Time Techniques for Data Distribution in Distributed Memory Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[61]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[62]  Jack J. Dongarra,et al.  Performance of various computers using standard linear equations software in a FORTRAN environment , 1988, CARN.

[63]  Anne Rogers,et al.  Compiling for Distributed Memory Architectures , 1994, IEEE Trans. Parallel Distributed Syst..

[64]  Satoshi Sekiguchi,et al.  Efficient vector processing on a dataflow supercomputer SIGMA-1 , 1988, Proceedings. SUPERCOMPUTING '88.

[65]  Milind Girkar Functional parallelism: theoretical foundations and implementation , 1992 .

[66]  Anoop Gupta,et al.  COOL: a language for parallel programming , 1990 .

[67]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[68]  Niklaus Wirth,et al.  Algorithms + Data Structures = Programs , 1976 .

[69]  Sachin S. Sapatnekar,et al.  A Framework for Exploiting Data and Functional Parallelism on Distributed Memory Multicomputers , 1994 .

[70]  William J. Dally,et al.  Experiences Implementing Dataflow on a General-Purpose Parallel Computer , 1991, ICPP.

[71]  Kenji Nishida,et al.  A hardware design of the SIGMA-1, a data flow computer for scientific computations , 1986 .

[72]  Barbara M. Chapman,et al.  Automatic Support for Data Distribution on Distributed Memory Multiprocessor Systems , 1993, LCPC.

[73]  Constantine D. Polychronopoulos,et al.  Microarchitecture support for dynamic scheduling of acyclic task graphs , 1992, MICRO.

[74]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[75]  Andrew A. Chien,et al.  Concurrent aggregates (CA) , 1990, PPOPP '90.

[76]  Peter A. Dinda,et al.  Communication and memory requirements as the basis for mapping task and data parallel programs , 1994, Proceedings of Supercomputing '94.

[77]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.