A framework for performance-based program partitioning

Most of the reported work in the Parallelizing Compilers literature focuses on analyzing program characteristics such as the dependencies, loop structures, memory reference patterns etc. to optimize the generated parallel code [3, 2, 7, 8, 14, 10]. Unfortunately, parallelizing compilers have very little or no knowledge of the actual run time behavior of the synthesized code on the underlying hardware due to the complex behavior of the underlaying hardware and software subsystems. This interaction could significantly affect the performance of the generated code and must be considered during program partitioning phases of the compiler. In this paper, we present an efficient and accurate performance model based program partitioning approach for parallel architectures. We introduce the concept of behavioral edges for capturing the interactions between computation and communication through parametric functions. We present an efficient algorithm to identify behavioral edges, modify costs using the behavioral edges and adapt the schedule to improve schedule length. The program partitioning phase uses the static estimates computed using the behavioral edges and partitioning is iteratively performed using the ordering PDG based on computed intervals. A significant performance improvement (factor of 10 in many cases) is demonstrated by using our framework.

[1]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[2]  Dharma P. Agrawal,et al.  Optimal Scheduling Algorithm for Distributed-Memory Machines , 1998, IEEE Trans. Parallel Distributed Syst..

[3]  Milind Girkar,et al.  Automatic Extraction of Functional Parallelism from Ordinary Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[4]  Dharma P. Agrawal,et al.  On Control Flow and Pseudo-Static Dynamic Allocation Strategy , 1990, ICPP.

[5]  Santosh Pande A compile time partitioning method for DOALL loops on distributed memory systems , 1996, Proceedings of the 1996 ICPP Workshop on Challenges for Parallel Processing.

[6]  Thomas Fahringer Estimating and Optimizing Performance for Parallel Programs , 1995, Computer.

[7]  S. Darbha,et al.  Effect of variation in compile time costs on scheduling tasks on distributed memory systems , 1996, Proceedings of 6th Symposium on the Frontiers of Massively Parallel Computation (Frontiers '96).

[8]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[9]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[10]  Ishfaq Ahmad,et al.  Dynamic Critical-Path Scheduling: An Effective Technique for Allocating Task Graphs to Multiprocessors , 1996, IEEE Trans. Parallel Distributed Syst..

[11]  Ken Kennedy,et al.  Requirements for DataParallel Programming Environments , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[12]  Geoffrey C. Fox,et al.  Compiling Fortran 90D/HPF for Distributed Memory MIMD Computers , 1994, J. Parallel Distributed Comput..

[13]  Dharma P. Agrawal,et al.  A Scalable Scheduling Scheme for Functional Parallelism on Distributed Memory Multiprocessor Systems , 1995, IEEE Trans. Parallel Distributed Syst..

[14]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[15]  P. Sadayappan,et al.  An approach to communication-efficient data redistribution , 1994, ICS '94.

[16]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[17]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[18]  Michael A. Harrison,et al.  Accurate static estimators for program optimization , 1994, PLDI '94.

[19]  Rudolf Eigenmann,et al.  Symbolic range propagation , 1995, Proceedings of 9th International Parallel Processing Symposium.

[20]  Dharma P. Agrawal,et al.  Run-time issues in program partitioning on distributed memory systems , 1995, Concurr. Pract. Exp..

[21]  Santosh Pande,et al.  Program Repartitioning on Varying Communication Cost Parallel Architectures , 1996, J. Parallel Distributed Comput..

[22]  Ken Kennedy,et al.  A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.

[23]  Andrew A. Chien,et al.  Software overhead in messaging layers: where does the time go? , 1994, ASPLOS VI.

[24]  Tao Yang,et al.  On the Granularity and Clustering of Directed Acyclic Task Graphs , 1993, IEEE Trans. Parallel Distributed Syst..

[25]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[26]  James R. Larus,et al.  The Wisconsin Wind Tunnel: virtual prototyping of parallel computers , 1993, SIGMETRICS '93.

[27]  Jaspal Subhlok,et al.  Optimal mapping of sequences of data parallel tasks , 1995, PPOPP '95.

[28]  Utpal Banerjee Loop Parallelization , 1994, Springer US.

[29]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops and Data Arrays for Distributed Shared-Memory Multiprocessors , 1995, IEEE Trans. Parallel Distributed Syst..

[30]  James R. Larus,et al.  Where is time spent in message-passing and shared-memory programs? , 1994, ASPLOS VI.

[31]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[32]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[33]  D.A. Reed,et al.  Scalable performance analysis: the Pablo performance analysis environment , 1993, Proceedings of Scalable Parallel Libraries Conference.

[34]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[35]  David W. Wall Predicting program behavior using real or estimated profiles , 1991, PLDI '91.