A Compiler Optimization Algorithm for Shared-Memory Multiprocessors

This paper presents a new compiler optimization algorithm that parallelizes applications for symmetric, shared-memory multiprocessors. The algorithm considers data locality, parallelism, and the granularity of parallelism. It uses dependence analysis and a simple cache model to drive its optimizations. It also optimizes across procedures by using interprocedural analysis and transformations. We validate the algorithm by hand-applying it to sequential versions of parallel, Fortran programs operating over dense matrices. The programs initially were hand-coded to target a variety of parallel machines using loop parallelism. We ignore the user's parallel loop directives, and use known and implemented dependence and interprocedural analysis to find parallelism. We then apply our new optimization algorithm to the resulting program. We compare the original parallel program to the hand-optimized program, and show that our algorithm improves three programs, matches four programs, and degrades one program in our test suite on a shared-memory, bus-based parallel machine with local caches. This experiment suggests existing dependence and interprocedural array analysis can automatically detect user parallelism, and demonstrates that user parallelized codes often benefit from our compiler optimizations, providing evidence that we need both parallel algorithms and compiler optimizations to effectively utilize parallel machines.

[1]  A. Veidenbaum,et al.  The cedar system and an initial performance study , 1993, ISCA '93.

[2]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[3]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[4]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[5]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[6]  Michael F. P. O'Boyle,et al.  Compiler reduction of synchronisation in shared virtual memory systems , 1995, ICS '95.

[7]  Mary W. Hall,et al.  The ParaScope Parallel Programming , 1992 .

[8]  Stephen J. Wright Stable Parallel Algorithms for Two-Point Boundary Value Problems , 1992, SIAM J. Sci. Comput..

[9]  Ken Kennedy,et al.  Optimizing for parallelism and data locality , 1992 .

[10]  Jaspal Subhlok,et al.  Analysis of synchronization in a parallel programming environment , 1992 .

[11]  Ken Kennedy,et al.  Automatic loop interchange , 2004, SIGP.

[12]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[13]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[14]  David A. Padua,et al.  Restructuring Fortran programs for Cedar , 1993, Concurr. Pract. Exp..

[15]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[16]  Wilson C. Hsieh,et al.  A framework for determining useful parallelism , 1988, ICS '88.

[17]  Ken Kennedy,et al.  Procedure cloning , 1992, Proceedings of the 1992 International Conference on Computer Languages.

[18]  Guangye Li,et al.  An Implementation of a Parallel Primal-Dual Interior Point Method for Multicommodity Flow Problems , 1992 .

[19]  Pen-Chung Yew,et al.  Efficient interprocedural analysis for program parallelization and restructuring , 1988, PPoPP 1988.

[20]  Kathryn S. McKinley,et al.  Automatic and interactive parallelization , 1992 .

[21]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[22]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[23]  Mary W. Hall,et al.  Interprocedural Transformations for Parallel Code Generation Interprocedural Transformations for Parallel Code Generation , 1991 .

[24]  Anita Osterhaug Guide to parallel programming on Sequent computer systems , 1989 .

[25]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[26]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[27]  Utpal Banerjee,et al.  A theory of loop permutations , 1990 .

[28]  V. Klema LINPACK user's guide , 1980 .

[29]  David A. Padua,et al.  On the Automatic Parallelization of the Perfect Benchmarks , 1998, IEEE Trans. Parallel Distributed Syst..

[30]  Stephen G. Nash,et al.  A General-Purpose Parallel Algorithm for Unconstrained Optimization , 1991, SIAM J. Optim..

[31]  David A. Padua,et al.  The Cedar System And An Initial Performance Study , 1993, Proceedings of the 20th Annual International Symposium on Computer Architecture.

[32]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[33]  John L. Hennessy,et al.  Finding and Exploiting Parallelism in an Ocean Simulation Program: Experience, Results, and Implications , 1992, J. Parallel Distributed Comput..

[34]  Michael Wolfe,et al.  Advanced Loop Interchanging , 1986, ICPP.

[35]  V. Sarkar,et al.  Automatic partitioning of a program dependence graph into parallel tasks , 1991, IBM J. Res. Dev..

[36]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[37]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[38]  Lawrence Rauchwerger,et al.  Effective Automatic Parallelization with Polaris , 1995 .

[39]  Ken Kennedy,et al.  Typed Fusion with Applications to Parallel and Sequential Code Generation , 1994 .

[40]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[41]  J. Dennis,et al.  Direct Search Methods on Parallel Machines , 1991 .

[42]  Vivek Sarkar,et al.  A general framework for iteration-reordering loop transformations , 1992, PLDI '92.

[43]  Ken Kennedy,et al.  Analysis and transformation in an interactive parallel programming tool , 1993, Concurr. Pract. Exp..

[44]  Stephen J. Wright,et al.  Parallel Algorithms for Banded Linear Systems , 1991, SIAM J. Sci. Comput..

[45]  William F. Appelbe,et al.  A New Algorithm for Global Optimization for Parallelism and Locality , 1994, LCPC.

[46]  Stephen G. Nash,et al.  Algorithm 711: BTN: software for parallel unconstrained optimization , 1992, TOMS.

[47]  William F. Appelbe,et al.  Program Transformation for Locality Using Affinity Regions , 1993, LCPC.

[48]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.