An introduction to compilation issues for parallel machines

The exploitation of today's high-performance computer systems requires the effective use of parallelism in many forms and at numerous levels. This survey article discusses program analysis and restructuring techniques that target parallel architectures. We first describe various categories of architectures that are oriented toward parallel computation models: vector architectures, shared-memory multiprocessors, massively parallel machines, message-passing architectures, VLIWs, and multithreaded architectures. We then describe a variety of optimization techniques that can be applied to sequential programs to effectively utilize the vector and parallel processing units. After an overview of basic dependence analysis, we present restructuring transformations on DO loops targeted both to vectorization and to concurrent execution, interprocedural and pointer analysis, task scheduling, instruction-level parallelization, and compiler-assisted data placement. We conclude that although tremendous advances have been made in dependence theory and in the development of a “toolkit” of transformations, parallel systems are used most effectively when the programmer interacts in the optimization process.

[1]  Ron Cytron,et al.  Interprocedural dependence analysis and parallelization , 1986, SIGP.

[2]  James R. Larus,et al.  Restructuring symbolic programs for concurrent execution on multiprocessors , 1989 .

[3]  Bowen Alpern,et al.  Detecting equality of variables in programs , 1988, POPL '88.

[4]  Janusz S. Kowalik,et al.  Parallel MIMD computation : the HEP supercomputer and its applications , 1985 .

[5]  David J. Kuck,et al.  The Burroughs Scientific Processor (BSP) , 1982, IEEE Transactions on Computers.

[6]  Jack J. Dongarra,et al.  Vectorizing compilers: a test suite and results , 1988, Proceedings. SUPERCOMPUTING '88.

[7]  Alan Norton,et al.  A Class of Boolean Linear Transformations for Conflict-Free Power-of-Two Stride Access , 1987, ICPP.

[8]  Edith Schonberg,et al.  Low-overhead scheduling of nested parallelism , 1991, IBM J. Res. Dev..

[9]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[10]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[11]  Guy L. Steele,et al.  Fortran at ten gigaflops: the connection machine convolution compiler , 1991, PLDI '91.

[12]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[13]  David J. Lilja,et al.  Combining hardware and software cache coherence strategies , 1991, ICS '91.

[14]  Ken Kennedy,et al.  A technique for summarizing data access and its use in parallelism enhancing transformations , 1989, PLDI '89.

[15]  J A Fisher,et al.  Instruction-Level Parallel Processing , 1991, Science.

[16]  Kevin Smith,et al.  PAT : An Interactive Fortran Parallelizing Assistant Tool , 1988, ICPP.

[17]  Mary E. Mace Memory storage patterns in parallel processing , 1987, The Kluwer international series in engineering and computer science.

[18]  Apostolos Dollas,et al.  The evolution of instruction sequencing , 1991, Computer.

[19]  Ken Kennedy,et al.  Computer support for machine-independent parallel programming in Fortran D , 1992 .

[20]  Henry G. Dietz,et al.  Static scheduling for barrier MIMD architectures , 1992, The Journal of Supercomputing.

[21]  Duncan H. Lawrie,et al.  The Prime Memory System for Array Access , 1982, IEEE Transactions on Computers.

[22]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[23]  Monica S. Lam,et al.  A Loop Transformation Theory and an Algorithm to Maximize Parallelism , 1991, IEEE Trans. Parallel Distributed Syst..

[24]  Vincent A. Guarna,et al.  A Technique for Analyzing Pointer and Structure References In Parallel Restructuring Compilers , 1988, ICPP.

[25]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[26]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[27]  Edsger W. Dijkstra,et al.  Cooperating sequential processes , 2002 .

[28]  Ahmed Sameh,et al.  The Illiac IV system , 1972 .

[29]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[30]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[31]  John R. Ellis,et al.  Bulldog: A Compiler for VLIW Architectures , 1986 .

[32]  H. Wijshoff Data organization in parallel computers , 1987 .

[33]  Mark N. Wegman,et al.  Analysis of pointers and structures , 1990, SIGP.

[34]  Alexandru Nicolau,et al.  Parallelizing Programs with Recursive Data Structures , 1989, IEEE Trans. Parallel Distributed Syst..

[35]  Phil Pfeiffer,et al.  Dependence analysis for pointer variables , 1989, PLDI '89.

[36]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[37]  Steve Johnson,et al.  Compiling C for vectorization, parallelization, and inline expansion , 1988, PLDI '88.

[38]  Ii Robert G. Babb Programming parallel processors , 1987 .

[39]  Harold Stuart Stone High-performance computer architecture (2nd ed.) , 1990 .

[40]  Thomas R. Gross,et al.  Postpass Code Optimization of Pipeline Constraints , 1983, TOPL.

[41]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[42]  Roy F. Touzeau A Fortran compiler for the FPS-164 scientific computer , 1984, SIGPLAN '84.

[43]  Robert G. Babb SARA: A Cray Assembly Language Speedup Tool , 1990 .

[44]  Williams Ludwell Harrison,et al.  The interprocedural analysis and automatic parallelization of Scheme programs , 1990, LISP Symb. Comput..

[45]  Richard M. Russell,et al.  The CRAY-1 computer system , 1978, CACM.

[46]  Keith D. Cooper,et al.  An experiment with inline substitution , 1991, Softw. Pract. Exp..

[47]  Ping-Sheng Tseng Compiling programs for a linear systolic array , 1990, PLDI '90.

[48]  Allen D. Malony,et al.  Faust: an integrated environment for parallel programming , 1989, IEEE Software.

[49]  Arvind,et al.  T: a multithreaded massively parallel architecture , 1992, ISCA '92.

[50]  Ken Kennedy,et al.  The ParaScope parallel programming environment , 1993, Proc. IEEE.

[51]  Thomas P. Murtagh,et al.  Lifetime analysis of dynamically allocated objects , 1988, POPL '88.

[52]  Arvind,et al.  T: A Multithreaded Massively Parallel Architecture , 1992, [1992] Proceedings the 19th Annual International Symposium on Computer Architecture.

[53]  Charles Koelbel,et al.  Supporting shared data structures on distributed memory architectures , 1990, PPOPP '90.

[54]  Daniel Gajski,et al.  CEDAR: a large scale multiprocessor , 1983, CARN.

[55]  Joel H. Saltz,et al.  Languages, compilers and run-time environments for distributed memory machines , 1992 .

[56]  Howard Jay Siegel,et al.  Interconnection networks for large-scale parallel processing: theory and case studies (2nd ed.) , 1985 .

[57]  Ken Kennedy,et al.  Compiling programs for distributed-memory multiprocessors , 2004, The Journal of Supercomputing.

[58]  Ralph Grishman,et al.  The NYU Ultracomputer—Designing an MIMD Shared Memory Parallel Computer , 1983, IEEE Transactions on Computers.

[59]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[60]  I. Waston,et al.  A practical data flow computer , 1982 .

[61]  David E. Culler,et al.  Monsoon: an explicit token-store architecture , 1998, ISCA '98.

[62]  Michael J. Flynn,et al.  Very high-speed computing systems , 1966 .

[63]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[64]  Ken Kennedy,et al.  An Implementation of Interprocedural Bounded Regular Section Analysis , 1991, IEEE Trans. Parallel Distributed Syst..

[65]  Tadashi Watanabe Architecture and performance of NEC supercomputer SX system , 1987, Parallel Comput..

[66]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[67]  Kevin P. McAuliffe,et al.  The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture , 1985, ICPP.

[68]  K. McKinley,et al.  Interactive Parallel Programming Using the Parascope Editor Interactive Parallel Programming Using the Parascope Editor , 1991 .

[69]  S. Lennart Johnsson The connection machine systems CM-5 , 1993, SPAA '93.

[70]  Alan E. Charlesworth,et al.  An Approach to Scientific Array Processing: The Architectural Design of the AP-120B/FPS-164 Family , 1981, Computer.

[71]  Santosh G. Abraham,et al.  Compile-Time Partitioning of Iterative Parallel Loops to Reduce Cache Coherency Traffic , 1991, IEEE Trans. Parallel Distributed Syst..

[72]  Allan Porterfield,et al.  Exploiting heterogeneous parallelism on a multithreaded multiprocessor , 1992, ICS '92.

[73]  Jack B. Dennis,et al.  Data Flow Supercomputers , 1980, Computer.

[74]  James R. Larus,et al.  Detecting conflicts between structure accesses , 1988, PLDI '88.

[75]  Siamak Arya An Optimal Instruction-Scheduling Model for a Class of Vector Processors , 1985, IEEE Transactions on Computers.

[76]  Paul Feautrier,et al.  Direct parallelization of call statements , 1986, SIGPLAN '86.

[77]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[78]  Lauren L. Smith Vectorizing C compilers: how good are they? , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[79]  Weijia Shang,et al.  Time Optimal Linear Schedules for Algorithms with Uniform Dependencies , 1991, IEEE Trans. Computers.

[80]  William J. Dally,et al.  Deadlock-Free Message Routing in Multiprocessor Interconnection Networks , 1987, IEEE Transactions on Computers.

[81]  H. T. Kung,et al.  The Warp Computer: Architecture, Implementation, and Performance , 1987, IEEE Transactions on Computers.

[82]  Zhiyuan Li,et al.  Program parallelization with interprocedural analysis , 2004, The Journal of Supercomputing.

[83]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[84]  David Bernstein,et al.  Scheduling expressions on a pipelined processor with a maximal delay of one cycle , 1989, TOPL.