Run-time methods for parallelizing partially parallel loops

In this paper we give a new run–time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generates inspector code that performs run–time preprocessing of the loop’s access pattern, and scheduler code that schedules (and executes) the loop iterations. The inspector is fully parallel, uses no synchronization, and can be applied to any loop. In addition, it can implement at run–time the two most effective transformations for increasing the amount of parallelism in a loop: array privatization and reduction parallelization (element–wise). We also describe a new scheme for constructing an optimal parallel execution schedule for the iterations of the loop.

[1]  MirchandaneyRavi,et al.  Run-Time Parallelization and Scheduling of Loops , 1991 .

[2]  José E. Moreira,et al.  Autoscheduling in a Distributed Shared-Memory Environment , 1994, LCPC.

[3]  Harry Berryman,et al.  Runtime Compilation Methods for Multicomputers , 1991, International Conference on Parallel Processing.

[4]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[5]  Francis Y. L. Chin,et al.  Efficient parallel algorithms for some graph problems , 1982, CACM.

[6]  Jay Hoeflinger,et al.  Cedar Fortran and other vector and parallel Fortran dialects , 1988, Supercomputing '88.

[7]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[8]  Lawrence Rauchwerger,et al.  The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization , 1994, ICS '94.

[9]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[10]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[11]  Chuan-Qi Zhu,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Transactions on Software Engineering.

[12]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[13]  Monica S. Lam,et al.  Data Dependence and Data-Flow Analysis of Arrays , 1992, LCPC.

[14]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[15]  Lawrence Rauchwerger,et al.  Parallelizing while loops for multiprocessor systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[16]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[17]  Joel H. Saltz,et al.  The doconsider loop , 1989, ICS '89.

[18]  Joel H. Saltz,et al.  The Preprocessed Doacross Loop , 1991, ICPP.

[19]  Panagiotis Takis Metaxas Parallel algorithms for graph problems , 1992 .

[20]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[21]  Steven J. Plimpton,et al.  Massively parallel methods for engineering and science problems , 1994, CACM.

[22]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[23]  Iain S. Duff,et al.  MA28 --- A set of Fortran subroutines for sparse unsymmetric linear equations , 1980 .

[24]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[25]  P. Sadayappan,et al.  An approach to synchronization for parallel computing , 1988, ICS '88.

[26]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[27]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[28]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[29]  John Zahorjan,et al.  Improving the performance of runtime parallelization , 1993, PPOPP '93.

[30]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.

[31]  Harry Berryman,et al.  A manual for PARTI runtime primitives , 1990 .