Parallelizing Compiler Techniques Based on Linear Inequalities

Shared-memory multiprocessors, built out of the latest microprocessors, are becoming a widely available class of computationally powerful machines. These affordable multiprocessors can potentially deliver supercomputer-like performance to the general public. To effectively harness the power of these machines it is important to find all the available parallelism in programs. The Stanford SUIF interprocedural parallelizer we have developed is capable of detecting coarser granularity of parallelism in sequential scientific applications than previously possible. Specifically, it can parallelize loops that span numerous procedures and hundreds of lines of codes, frequently requiring modifications to array data structures such as array privatization. Measurements from several standard benchmark suites demonstrate that aggressive interprocedural analyses can substantially advance the capability of automatic parallelization technology. However, locating parallelism is not sufficient in achieving high performance. It is critical to make effective use of the memory hierarchy. In parallel applications, false sharing and cache conflicts between processors can significantly reduce performance. We have developed the first compiler that automatically performs a full suite of data transformations (a combination of transposing, strip-mining and padding). The performance of many benchmarks improves drastically after the data transformations. We introduce a framework based on systems of linear inequalities for developing compiler algorithms. Many of the whole program analyses and aggressive optimizations in our compiler employ this framework. Using this framework general solutions to many compiler problems can be found systematically.

[1]  George B. Dantzig,et al.  Linear programming and extensions , 1965 .

[2]  Constantine D. Polychronopoulos,et al.  Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.

[3]  Ken Kennedy,et al.  A linear-time algorithm for computing the memory access sequence in data-parallel programs , 1995, PPOPP '95.

[4]  Alexander V. Veidenbaum,et al.  Detecting redundant accesses to array data , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[5]  Brian W. Kernighan,et al.  The UNIX™ programming environment , 1979, Softw. Pract. Exp..

[6]  Mary Hall,et al.  Interprocedural analysis for parallelization: design and experience , 1995 .

[7]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[8]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[9]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[10]  I-Chen Wu,et al.  An architecture independent programming language for low-level vision , 1989, Comput. Vis. Graph. Image Process..

[11]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[12]  Monica S. Lam,et al.  Multiprocessors from a software perspective , 1996, IEEE Micro.

[13]  Monica S. Lam,et al.  Efficient context-sensitive pointer analysis for C programs , 1995, PLDI '95.

[14]  Chau-Wen Tseng,et al.  Compiler optimizations for eliminating barrier synchronization , 1995, PPOPP '95.

[15]  François Irigoin,et al.  Interprocedural Array Region Analyses , 1996, International Journal of Parallel Programming.

[16]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[17]  Thomas Rauber,et al.  Automatic Parallelization for Distributed Memory Multiprocessors , 1994, Automatic Parallelization.

[18]  Charles Koelbel Compile-time generation of regular communications patterns , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[19]  Kleanthis Psarris,et al.  On the perfect accuracy of an approximate subscript analysis test , 1990, ICS '90.

[20]  Anoop Gupta,et al.  An empirical comparison of the Kendall Square Research KSR-1 and Stanford DASH multiprocessors , 1993, Supercomputing '93. Proceedings.

[21]  Thomas R. Gross,et al.  Compiling task and data parallel programs for iWarp , 1993, SIGP.

[22]  Jean-Louis Pazat,et al.  Compiling sequential programs for distributed memory parallel computers with Pandore II , 1993 .

[23]  A. Gupta,et al.  The Stanford FLASH multiprocessor , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[24]  Guang R. Gao,et al.  Scheduling and mapping: software pipelining in the presence of structural hazards , 1995, PLDI '95.

[25]  Lawrence Rauchwerger,et al.  Effective Automatic Parallelization with Polaris , 1995 .

[26]  Barbara M. Chapman,et al.  Programming in Vienna Fortran , 1992, Sci. Program..

[27]  Monica S. Lam,et al.  Maximizing parallelism and minimizing synchronization with affine transforms , 1997, POPL '97.

[28]  Williams Ludwell HarrisonIII The interprocedural analysis and automatic parallelization of Scheme programs , 1989 .

[29]  Geoffrey C. Fox,et al.  A Compilation Approach for Fortran 90D/HPF Compilers on Distributed Memory MIMD Computers , 1993 .

[30]  David M. Fenwick,et al.  The AlphaServer 8000 Series: High-end Server Platform Development , 1995, Digit. Tech. J..

[31]  Charles Koelbel,et al.  High Performance Fortran Handbook , 1993 .

[32]  Scott W. Haney,et al.  Is C++ fast enough for scientific computing? , 1994 .

[33]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS 1987.

[34]  Margaret Martonosi,et al.  Evaluating the impact of advanced memory systems on compiler-parallelized codes , 1995, PACT.

[35]  W. Kelly,et al.  Code generation for multiple mappings , 1995, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[36]  Ruben W. Castelino,et al.  Internal Organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC Microprocessor , 1995, Digit. Tech. J..

[37]  Susan J. Eggers,et al.  Eliminating False Sharing , 1991, ICPP.

[38]  Edith Schonberg,et al.  An HPF Compiler for the IBM SP2 , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[39]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[40]  Eugene W. Myers,et al.  A precise inter-procedural data flow algorithm , 1981, POPL '81.

[41]  Computer Staff Parallel processors were the future ... and may yet be , 1996 .

[42]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[43]  Robert P. Colwell,et al.  A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.

[44]  Dror Eliezer Maydan Accurate analysis of array references , 1993 .

[45]  Donald Yeung,et al.  THE MIT ALEWIFE MACHINE: A LARGE-SCALE DISTRIBUTED-MEMORY MULTIPROCESSOR , 1991 .

[46]  Henry G. Dietz,et al.  Reduction of Cache Coherence Overhead by Compiler Data Layout and Loop Transformation , 1991, LCPC.

[47]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[48]  Ronald L. Graham,et al.  Concrete mathematics - a foundation for computer science , 1991 .

[49]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[50]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[51]  Susan J. Eggers,et al.  Reducing false sharing on shared memory multiprocessors through compile time data transformations , 1995, PPOPP '95.

[52]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[53]  Ken Kennedy,et al.  The ParaScope parallel programming environment , 1993, Proc. IEEE.

[54]  Monica S. Lam,et al.  Data Dependence and Data-Flow Analysis of Arrays , 1992, LCPC.

[55]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[56]  Jeffrey D. Ullman,et al.  Global Data Flow Analysis and Iterative Algorithms , 1976, J. ACM.

[57]  William Pugh,et al.  Minimizing communication while preserving parallelism , 1996, ICS '96.

[58]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[59]  Monica S. Lam,et al.  Array-data flow analysis and its use in array privatization , 1993, POPL '93.

[60]  Zbigniew Chamski,et al.  Nested loop sequences: towards efficient loop structures in automatic parallelization , 1994, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[61]  J. Palmer,et al.  Connection Machine model CM-5 system overview , 1992, [Proceedings 1992] The Fourth Symposium on the Frontiers of Massively Parallel Computation.

[62]  François Irigoin Interprocedural analyses for programming environments , 1993 .

[63]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[64]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[65]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[66]  Paul Feautrier,et al.  Construction of Do Loops from Systems of Affine Constraints , 1995, Parallel Process. Lett..

[67]  Ken Kennedy,et al.  Incremental dependence analysis , 1990 .

[68]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[69]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[70]  Randy H. Katz,et al.  The effect of sharing on the cache and bus performance of parallel programs , 1989, ASPLOS III.

[71]  George B. Dantzig,et al.  Fourier-Motzkin Elimination and Its Dual , 1973, J. Comb. Theory A.

[72]  John R. Grout,et al.  Inline Expansion For The Polaris Research Compiler , 1995 .

[73]  Monica S. Lam,et al.  Data and computation transformations for multiprocessors , 1995, PPOPP '95.

[74]  Monica S. Lam,et al.  Communication optimization and code generation for distributed memory machines , 1993, PLDI '93.

[75]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[76]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[77]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[78]  William F. Appelbe,et al.  Optimizing Parallel Programs Using Affinity Regions , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[79]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[80]  M. Schlansker,et al.  The Cydra 5 computer system architecture , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.

[81]  Chau-Wen Tseng An optimizing Fortran D compiler for MIMD distributed-memory machines , 1993 .

[82]  Samuel P. Midkiff,et al.  An Empirical Study of Precise Interprocedural Array Analysis , 1994, Sci. Program..

[83]  Thomas R. Gross,et al.  Structured dataflow analysis for arrays and its use in an optimizing compiler , 1990, Softw. Pract. Exp..

[84]  Michael Gerndt,et al.  Automatic parallelization for distributed-memory multiprocessing systems , 1989 .

[85]  P.-S. Tseng,et al.  A parallelizing compiler for distributed memory parallel computers , 1989, PLDI 1989.

[86]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[87]  Monica S. Lam,et al.  Interprocedural Parallelization Analysis: Preliminary Results , 1995 .

[88]  David A. Padua,et al.  Experience in the Automatic Parallelization of Four Perfect-Benchmark Programs , 1991, LCPC.

[89]  Charles Koelbel,et al.  Semi-Automatic Domain Decomposition in BLAZE , 1987, ICPP.

[90]  Fred C. Chow,et al.  A portable machine-independent global optimizer--design and measurements , 1984 .

[91]  Steven W. K. Tjiang,et al.  SUIF: an infrastructure for research on parallelizing and optimizing compilers , 1994, SIGP.

[92]  Anoop Gupta,et al.  The DASH prototype: implementation and performance , 1992, ISCA '92.

[93]  Randolph G. Scarborough,et al.  A Vectorizing Fortran Compiler , 1986, IBM J. Res. Dev..

[94]  Yunheung Paek,et al.  Parallel Programming with Polaris , 1996, Computer.

[95]  William Pugh,et al.  Eliminating false data dependences using the Omega test , 1992, PLDI '92.

[96]  Monica S. Lam,et al.  An Overview of a Compiler for Scalable Parallel Machines , 1993, LCPC.

[97]  Marina C. Chen,et al.  The Data Alignment Phase in Compiling Programs for Distrubuted-Memory Machines , 1991, J. Parallel Distributed Comput..

[98]  Williams Ludwell Harrison,et al.  The interprocedural analysis and automatic parallelization of Scheme programs , 1990, LISP Symb. Comput..

[99]  P. Feautrier Parametric integer programming , 1988 .

[100]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[101]  Peter Michielse Programming the Convex Exemplar Series SPP System , 1994, PARA.

[102]  John R. Gilbert,et al.  Generating local addresses and communication sets for data-parallel programs , 1993, PPOPP '93.

[103]  Jean-Louis Pazat,et al.  PANDORE: a system to manage data distribution , 1992 .

[104]  Barbara G. Ryder,et al.  Interprocedural modification side effect analysis with pointer aliasing , 1993, PLDI '93.

[105]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[106]  John M. Mellor-Crummey,et al.  FIAT: A Framework for Interprocedural Analysis and Transfomation , 1993, LCPC.

[107]  Chau-Wen Tseng,et al.  An Overview of the SUIF Compiler for Scalable Parallel Machines , 1995, PPSC.

[108]  Monica S. Lam,et al.  Detecting Coarse - Grain Parallelism Using an Interprocedural Parallelizing Compiler , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[109]  Hudson Benedito Ribas Obtaining Dependence Vectors for Nested-Loop Computations , 1990, ICPP.

[110]  Ken Kennedy,et al.  A Methodology for Procedure Cloning , 1993, Computer languages.

[111]  Pierre Jouvelot,et al.  Semantical interprocedural parallelization: an overview of the PIPS project , 1991 .

[112]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[113]  Barbara M. Chapman,et al.  Handling Distributed Data in Vienna Fortran Procedures , 1992, LCPC.

[114]  P. Feautrier Array expansion , 1988 .

[115]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[116]  John R. Gilbert,et al.  Aligning parallel arrays to reduce communication , 1994, Proceedings Frontiers '95. The Fifth Symposium on the Frontiers of Massively Parallel Computation.

[117]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[118]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[119]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[120]  Anant Agarwal,et al.  Automatic Partitioning of Parallel Loops for Cache-Coherent Multiprocessors , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[121]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[122]  Joseph A. Fisher,et al.  Trace Scheduling: A Technique for Global Microcode Compaction , 1981, IEEE Transactions on Computers.

[123]  Josep Torrellas,et al.  Share Data Placement Optimizations to Reduce Multiprocessor Cache Miss Rates , 1990, ICPP.

[124]  Michael L. Scott,et al.  False sharing and its effect on shared memory performance , 1993 .

[125]  Martine Ancourt Generation automatique de codes de transfert pour multiprocesseurs a memoires locales , 1991 .

[126]  Peng Tu,et al.  Automatic array privatization and demand-driven symbolic analysis , 1996 .

[127]  Piyush Mehrotra,et al.  Programming distributed memory architectures using Kali , 1990 .

[128]  Paul Havlak,et al.  Interprocedural symbolic analysis , 1995 .

[129]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[130]  Thierry Jéron,et al.  Towards Automatic Distribution of Testers for Distributed Conformance Testing , 1998, FORTE.

[131]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.