Compiler optimizations for parallel loops with fine-grained synchronization

Loops in scientific and engineering applications provide a rich source of parallelism. In order to obtain a higher level of parallelism, loops with loop-carried dependences, which are largely serialized using the traditional techniques, need to be parallelized with fine-grained synchronization. This approach, so-called DOACROSS parallelization, requires new optimization strategies in order to preserve parallelism while minimizing the amount of inter-processor communication. In this thesis, I examine closely issues involved in the DOACROSS parallelization. There are two focuses in this work: (1) increasing parallelism, and (2) reducing communication overhead. Strategies for four major optimization problems are proposed and described in detail in this thesis. Our statement re-ordering and redundant synchronization optimization can enhance the overlap between iterations with reduced synchronization for loops with uniform dependences (i.e., with constant dependence distance vectors). For loops with non-uniform dependences, a new dependence uniformization scheme is used to compose a small set of uniform dependences which has a small dependence cone. These uniform dependences can be used to preserve all non-uniform dependences, and a small dependence cone size implies a higher speedup and lower communication overhead. Our last loop transformation is for loops with unknown dependences at compile time. It is a runtime parallelization technique with good locality and high concurrency. It has fairly consistent performance because its runtime analysis, which is usually the performance bottleneck of most runtime schemes, requires less global communication. We provide performance measurement and comparison with the schemes previously proposed. The results indicate that these schemes out-perform earlier schemes in terms of higher parallelism and lower communication requirement. These schemes form an integral part of the future high performance parallelizing compilers.

[1]  Youcef Saad,et al.  A Basic Tool Kit for Sparse Matrix Computations , 1990 .

[2]  P. Sadayappan,et al.  Removal of redundant dependences in DOACROSS loops with constant dependences , 1991, PPOPP '91.

[3]  Michael Ian Shamos,et al.  Computational geometry: an introduction , 1985 .

[4]  Joel H. Saltz,et al.  Runtime compilation techniques for data partitioning and communication schedule reuse , 1993, Supercomputing '93. Proceedings.

[5]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[6]  David S. Johnson,et al.  Computers and In stractability: A Guide to the Theory of NP-Completeness. W. H Freeman, San Fran , 1979 .

[7]  Robert H. Halstead,et al.  MULTILISP: a language for concurrent symbolic computation , 1985, TOPL.

[8]  Pen-Chung Yew,et al.  On Data Synchronization For Multiprocessors , 1989, The 16th Annual International Symposium on Computer Architecture.

[9]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[10]  Yoichi Muraoka,et al.  Parallelism exposure and exploitation in programs , 1971 .

[11]  David S. Johnson,et al.  Computers and Intractability: A Guide to the Theory of NP-Completeness , 1978 .

[12]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[13]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[14]  Ken Kennedy,et al.  Analysis of event synchronization in a parallel programming tool , 1990, PPOPP '90.

[15]  David L. Kuck,et al.  The Structure of Computers and Computations , 1978 .

[16]  Phillip L. Shaffer Minimization of Interprocessor Synchronization In Multiprocessors with Shared and Private Memory , 1989, ICPP.

[17]  A. Veidenbaum,et al.  The cedar system and an initial performance study , 1993, ISCA '93.

[18]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[19]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[20]  B. S. Garbow,et al.  Matrix Eigensystem Routines — EISPACK Guide , 1974, Lecture Notes in Computer Science.

[21]  Constantine D. Polychronopoulos,et al.  Advanced Loop Optimizations for Parallel Computers , 1988, ICS.

[22]  P. Sadayappan,et al.  An approach to synchronization for parallel computing , 1988, ICS '88.

[23]  Pen-Chung Yew,et al.  A Synchronization Scheme and Its Applications for Large Multiprocessor Systems , 1984, ICDCS.

[24]  Zhiyu Shen,et al.  An Empirical Study of Fortran Programs for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[25]  Zhiyuan Li,et al.  A technique for reducing synchronization overhead in large scale multiprocessors , 1985, ISCA '85.

[26]  Robert E. Tarjan,et al.  Depth-First Search and Linear Graph Algorithms , 1972, SIAM J. Comput..

[27]  E. Reingold,et al.  Combinatorial Algorithms: Theory and Practice , 1977 .

[28]  Anoop Gupta,et al.  Performance evaluation of memory consistency models for shared-memory multiprocessors , 1991, ASPLOS IV.

[29]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[30]  H. F. Jordan A Special Purpose Architecture for Finite Element Analysis , 1978 .

[31]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[32]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[33]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[34]  Anoop Gupta,et al.  The directory-based cache coherence protocol for the DASH multiprocessor , 1990, ISCA '90.

[35]  S. P. Midkiff Automatic generation of synchronization instructions for parallel processors , 1986 .

[36]  Ding-Kai Chen,et al.  MaxPar: An execution driven simulator for studying parallel systems , 1989 .

[37]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[38]  Zhiyuan Li,et al.  On Reducing Data Synchronization in Multiprocessed Loops , 1987, IEEE Transactions on Computers.

[39]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[40]  Pen-Chung Yew,et al.  Efficient Doacross execution on distributed shared-memory multiprocessors , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[41]  Richard J. Anderson,et al.  A Scheduling Problem Arising From Loop Parallelization on MIMD Machines , 1988, AWOC.

[42]  Z. Chen,et al.  On uniformization of affine dependence algorithms , 1992, [1992] Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing.

[43]  David A. Padua,et al.  Compiler Generated Synchronization for Do Loops , 1986, ICPP.

[44]  David A. Padua,et al.  A Comparison of Four Synchronization Optimization Techniques , 1991, ICPP.

[45]  Pen-Chung Yew,et al.  Execution-driven tools for parallel simulation of parallel architectures and applications , 1993, Supercomputing '93. Proceedings.

[46]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[47]  Joel H. Saltz,et al.  The Preprocessed Doacross Loop , 1991, ICPP.

[48]  Pen-Chung Yew,et al.  Cedar architecture and its software , 1989, [1989] Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences. Volume 1: Architecture Track.

[49]  Lionel M. Ni,et al.  Dependence Uniformization: A Loop Parallelization Technique , 1993, IEEE Trans. Parallel Distributed Syst..

[50]  Ronald Gary Cytron Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing) , 1984 .

[51]  David Alejandro Padua Haiek Multiprocessors: discussion of some theoretical and practical problems , 1980 .

[52]  John Zahorjan,et al.  Improving the performance of runtime parallelization , 1993, PPOPP '93.