A scalable method for run-time loop parallelization

Current parallelizing compilers do a reasonable job of extracting parallelism from programs with regular, well behaved, statically analyzable access patterns. However, they cannot extract a significant fraction of the avaialable, parallelism if the program has a complex and/or statically insufficiently defined access pattern, e.g., simulation programs with irregular domains and/or dynamically changing interactions. Since such programs represent a large fraction of all applications, techniques are needed for extracting their inherent parallelism at run-time. In this paper we give a new run-time technique for finding an optimal parallel execution schedule for a partially parallel loop, i.e., a loop whose parallelization requires synchronization to ensure that the iterations are executed in the correct order. Given the original loop, the compiler generatesinspector code that performas run-time preprocessing of the loop's access pattern, andscheduler code that schedules (and executes) the loop interations. The inspector is fully parallel, uses no sychronization, and can be applied to any loop (from which an inspector can be extracted). In addition, it can implement at run-time the two most effective transformations for increasing the amount of parallelism in a loop:array privatization andreduction parallelization (elementwise). The ability to identify privatizable and reduction variables is very powerful since it eliminates the data dependences involving these variables and

[1]  Steven J. Plimpton,et al.  Massively parallel methods for engineering and science problems , 1994, CACM.

[2]  P. Sadayappan,et al.  An approach to synchronization for parallel computing , 1988, ICS '88.

[3]  B J Smith,et al.  A pipelined, shared resource MIMD computer , 1986 .

[4]  Iain S. Duff,et al.  MA28 --- A set of Fortran subroutines for sparse unsymmetric linear equations , 1980 .

[5]  David A. Padua,et al.  Compiler Algorithms for Synchronization , 1987, IEEE Transactions on Computers.

[6]  Panagiotis Takis Metaxas Parallel algorithms for graph problems , 1992 .

[7]  J. E. Thornton Design of a Computer: The Control Data 6600 , 1970 .

[8]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[9]  Lawrence Rauchwerger,et al.  Parallelizing while loops for multiprocessor systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[10]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[11]  Nancy M. Amato,et al.  Run-time methods for parallelizing partially parallel loops , 1995, ICS '95.

[12]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[13]  R. M. Tomasulo,et al.  An efficient algorithm for exploiting multiple arithmetic units , 1995 .

[14]  Sajal K. Das,et al.  Book Review: Introduction to Parallel Algorithms and Architectures : Arrays, Trees, Hypercubes by F. T. Leighton (Morgan Kauffman Pub, 1992) , 1992, SIGA.

[15]  Jay Hoeflinger,et al.  Cedar Fortran and other vector and parallel Fortran dialects , 1988, Proceedings. SUPERCOMPUTING '88.

[16]  Lawrence Rauchwerger,et al.  The privatizing DOALL test: a run-time technique for DOALL loop identification and array privatization , 1994, ICS '94.

[17]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[18]  Harry Berryman,et al.  A manual for PARTI runtime primitives , 1990 .

[19]  Pen-Chung Yew,et al.  A Scheme to Enforce Data Dependence on Large Multiprocessor Systems , 1987, IEEE Trans. Software Eng..

[20]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[21]  Peter M. Schwarz,et al.  Experience Using Multiprocessor Systems—A Status Report , 1980, CSUR.

[22]  Monica S. Lam,et al.  Data Dependence and Data-Flow Analysis of Arrays , 1992, LCPC.

[23]  Joel H. Saltz,et al.  The Preprocessed Doacross Loop , 1991, ICPP.

[24]  Joel H. Saltz,et al.  Run-time parallelization and scheduling of loops , 1989, SPAA '89.

[25]  Larry Rudolph,et al.  Efficient parallel algorithms for graph problems , 1990, Algorithmica.

[26]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[27]  Lawrence Rauchwerger,et al.  The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization , 1995, PLDI '95.

[28]  Josep Torrellas,et al.  An efficient algorithm for the run-time parallelization of DOACROSS loops , 1994, Proceedings of Supercomputing '94.

[29]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[30]  Joel H. Saltz,et al.  The doconsider loop , 1989, ICS '89.

[31]  José E. Moreira,et al.  Autoscheduling in a Distributed Shared-Memory Environment , 1994, LCPC.

[32]  Harry Berryman,et al.  Runtime Compilation Methods for Multicomputers , 1991, International Conference on Parallel Processing.

[33]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[34]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[35]  Wilson C. Hsieh,et al.  Automatic generation of nested, fork-join parallelism , 2004, The Journal of Supercomputing.

[36]  F. Leighton,et al.  Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes , 1991 .

[37]  David A. Padua,et al.  Array privatization for shared and distributed memory machines (extended abstract) , 1993, SIGP.

[38]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[39]  John Zahorjan,et al.  Improving the performance of runtime parallelization , 1993, PPOPP '93.

[40]  Constantine D. Polychronopoulos Compiler Optimizations for Enhancing Parallelism and Their Impact on Architecture Design , 1988, IEEE Trans. Computers.