Automatic data layout for distributed-memory machines

The goal of languages like Fortran D or High Performance Fortran (HPF) is to provide a simple yet efficient machine-independent parallel programming model. After the algorithm selection, the data layout choice is the key intellectual challenge in writing an efficient program in such languages. The performance of a data layout depends on the target compilation system, the target machine, the problem size, and the number of available processors. This makes the choice of a good layout extremely difficult for most users of such languages. If languages such as HPF are to find general acceptance, the need for data layout selection support has to be addressed. We beleive that the appropriate way to provide the needed support is through a tool that generates data layout specifications automatically. This article discusses the design and implementation of a data layout selection tool that generates HPF-style data layout specifications automatically. Because layout is done in a tool that is not embedded in the target compiler and hence will be run only a few times during the tuning phase of an application, it can use techniques such as integer programming that may be considered too computationally expensive for inclusion in production compilers. The proposed framework for automatic data layout selection builds and examines search spaces of candidate data layouts. A candidate layout is an efficient layout for some part of the program. After the generation of search spaces, a single candidate layout is selected for each program part, resulting in a data layout for the entire program. A good overall data layout may require the remapping of arrays between program parts. A performance estimator based on a compiler model, an execution model, and a machine model are needed to predict the execution time of each candidate layout and the costs of possible remappings between candidate data layouts. In the proposed framework, instances of NP-complete problems are solved during the construction of candidate layout search spaces and the final selection of candidate layouts from each search space. Rather than resorting to heuristics, the framework capitalizes on state-of-the-art 0-1 integer programming technology to compute optimal solutions of these NP-complete problems. A prototype data layout assistant tool based on our framework has been implemented as part of the D system currently under development at Rice University. The article reports preliminary experimental results. The results indicate that the framework is efficient and allows the generation of data layouts of high quality.

[1]  Amir Averbuch,et al.  Experience with a Portable Parallelizing Pascal Compiler , 1991, ICPP.

[2]  William Pugh,et al.  The Omega test: A fast and practical integer programming algorithm for dependence analysis , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[3]  Paul Feautrier Fine-Grain Scheduling under Resource Constraints , 1994, LCPC.

[4]  Guy L. Steele,et al.  Compiling Fortran 8x array features for the connection machine computer system , 1988, PPEALS '88.

[5]  John R. Gilbert,et al.  Array Distribution in Data-Parallel Programs , 1994, LCPC.

[6]  Robert E. Bixby,et al.  Implementing the Simplex Method: The Initial Basis , 1992, INFORMS J. Comput..

[7]  Manish Gupta,et al.  Compile-time estimation of communication costs on multicomputers , 1992, Proceedings Sixth International Parallel Processing Symposium.

[8]  Piyush Mehrotra,et al.  Vienna Fortran—a Fortran language extension for distributed memory multiprocessors , 1992 .

[9]  Monica S. Lam,et al.  Automatic computation and data decomposition for multiprocessors , 1997 .

[10]  Barbara M. Chapman,et al.  Knowledge-Based Parallelization for Distributed Memory Systems , 1991, ACPC.

[11]  Geoffrey C. Fox,et al.  An Automatic and Symbolic Parallelization System for Distributed Memory Parallel Computers , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[12]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[13]  Manish Gupta Automatic data partitioning on distributed memory multicomputers. Ph.D. Thesis , 1992 .

[14]  Ii C. D. Callahan A global approach to detection of parallelism , 1987 .

[15]  PeiZong Lee,et al.  Compiling Efficient Programs for Tightly-Coupled Distributed Memory Computers , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[16]  Anne Rogers,et al.  Process decomposition through locality of reference , 1989, PLDI '89.

[17]  Guy L. Steele,et al.  Massively parallel data optimization , 1988, Proceedings., 2nd Symposium on the Frontiers of Massively Parallel Computation.

[18]  Robert E. Bixby,et al.  Progress in Linear Programming , 1993 .

[19]  Ken Kennedy,et al.  A static performance estimator to guide data partitioning decisions , 1991, PPOPP '91.

[20]  Alfred V. Aho,et al.  Compilers: Principles, Techniques, and Tools , 1986, Addison-Wesley series in computer science / World student series edition.

[21]  KremerUlrich,et al.  Automatic data layout for distributed-memory machines , 1998 .

[22]  Sriram V. Pemmaraju,et al.  Automatic Data Decomposition for Message-Passing Machines , 1997, LCPC.

[23]  Hans P. Zima,et al.  Automatic Support for Data Distribution , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[24]  Jordi Torres,et al.  Detecting and Using Affinity in an Automatic Data Distribution Tool , 1994, LCPC.

[25]  Jingke Li Compiling crystal for distributed-memory machines , 1992 .

[26]  Michael Philippsen,et al.  Automatic alignment of array data and processes to reduce communication time on DMPPs , 1995, PPOPP '95.

[27]  John A. Chandy,et al.  Communication Optimizations Used in the Paradigm Compiler for Distributed-Memory Multicomputers , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[28]  Chau-Wen Tseng An optimizing Fortran D compiler for MIMD distributed-memory machines , 1993 .

[29]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[30]  Ulrich Kremer Automatic Data Layout with Read-Only Replication and Memory Constraints , 1997, LCPC.

[31]  K. Knobe,et al.  Data optimization: minimizing residual interprocessor data motion on SIMD machines , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[32]  Guang R. Gao,et al.  Minimizing register requirements under resource-constrained rate-optimal software pipelining , 1994, Proceedings of MICRO-27. The 27th Annual IEEE/ACM International Symposium on Microarchitecture.

[33]  K. Kennedy,et al.  Automatic Data Layout for High Performance Fortran , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[34]  Ken Kennedy,et al.  An Overview of the Fortran D Programming System , 1991, LCPC.

[35]  Ken Kennedy,et al.  The parascope editor: an interactive parallel programming tool , 1993, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[36]  Laurence A. Wolsey,et al.  Integer and Combinatorial Optimization , 1988, Wiley interscience series in discrete mathematics and optimization.

[37]  Erik H. D'Hollander,et al.  Partitioning and Labeling of Index Sets in DO Loops with Constant Dependence Vectors , 1989, ICPP.

[38]  John R. Gilbert,et al.  Automatic array alignment in data-parallel programs , 1993, POPL '93.

[39]  D.A. Reed,et al.  An Integrated Compilation and Performance Analysis Environment for Data Parallel Programs , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[40]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[41]  Michael Gerndt,et al.  Advanced tools and techniques for automatic parallelization , 1988, Parallel Comput..

[42]  J. Ramanujam,et al.  A methodology for parallelizing programs for multicomputers and complex memory multiprocessors , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[43]  John R. Gilbert,et al.  Optimal Expression Evaluation for Data Parallel Architectures , 1991, J. Parallel Distributed Comput..

[44]  Eduard Ayguade,et al.  Dynamic data distribution with control flow analysis , 1996, Supercomputing '96.

[45]  Charles Koelbel,et al.  Compiling Global Name-Space Parallel Loops for Distributed Execution , 1991, IEEE Trans. Parallel Distributed Syst..

[46]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[47]  Christoph W. Keßler,et al.  Knowledge-Based Automatic Parallelization by Pattern Recognition , 1994, Automatic Parallelization.

[48]  John R. Gilbert,et al.  Optimal evaluation of array expressions on massively parallel machines , 1995, TOPL.

[49]  Guy L. Steele,et al.  Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines , 1990, J. Parallel Distributed Comput..

[50]  E. Ayguade,et al.  A Novel Approach Towards Automatic Data Distribution , 1995, Proceedings of the IEEE/ACM SC95 Conference.

[51]  Ulrich Kremer,et al.  Optimal and Near-Optimal Solutions for Hard Compilation Problems , 1997, Parallel Process. Lett..

[52]  R. Kent Dybvig,et al.  The Scheme Programming Language , 1995 .

[53]  Ken Kennedy,et al.  The D Editor: a new interactive parallel programming tool , 1994, Proceedings of Supercomputing '94.

[54]  P. Hilfinger Review of "The Ada programming language by Ian C. Pyle", Prentice-Hall, Inc., Englewood Cliffs, N.J., 1981. , 1982, ALET.

[55]  Ken Kennedy,et al.  An Interactive Environment for Data Partitioning and Distribution , 1990, Proceedings of the Fifth Distributed Memory Computing Conference, 1990..

[56]  Jordi García Almiñana Automatic data distribution for massively parallel processors , 1997 .

[57]  Guang R. Gao,et al.  Automatic data and computation decomposition for distributed memory machines , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[58]  Prithviraj Banerjee,et al.  Automatic Data Partitioning on Distributed Memory Multiprocessors , 1991, The Sixth Distributed Memory Computing Conference, 1991. Proceedings.

[59]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[60]  Guang R. Gao,et al.  A novel framework of register allocation for software pipelining , 1993, POPL '93.

[61]  Matthew S. Hecht,et al.  Flow Analysis of Computer Programs , 1977 .

[62]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[63]  Prithviraj Banerjee,et al.  Interprocedural Array Redistribution Data-Flow Analysis , 1996, LCPC.

[64]  Ken Kennedy,et al.  Requirements for DataParallel Programming Environments , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[65]  Charles Koelbel,et al.  High Performance Fortran Handbook , 1993 .

[66]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1993, J. Parallel Distributed Comput..

[67]  Vikram S. Adve,et al.  Requirements for Data-Parallel Programming Environments , 1994 .

[68]  Geoffrey C. Fox,et al.  Interpreting the performance of HPF/Fortran 90D , 1994, Proceedings of Supercomputing '94.

[69]  Ulrich Kremer,et al.  NP-completeness of Dynamic Remapping , 1993 .

[70]  Michael Weiss Strip mining on SIMD architectures , 1991, ICS '91.

[71]  G. C. Fox,et al.  Solving Problems on Concurrent Processors , 1988 .

[72]  Jingke Li,et al.  Index domain alignment: minimizing cost of cross-referencing between distributed arrays , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[73]  Giovanni Rinaldi,et al.  A Branch-and-Cut Algorithm for the Resolution of Large-Scale Symmetric Traveling Salesman Problems , 1991, SIAM Rev..

[74]  Thomas Fahringer,et al.  Automatic performance prediction to support parallelization of Fortran programs for massively parallel systems , 1992, ICS '92.

[75]  Ken Kennedy,et al.  Fortran D Language Specification , 1990 .

[76]  Ulrich Kremer,et al.  Compositional Oil Reservoir Simulation in Fortran D: a Feasibility Study On Intel iPsc/860 , 1994, Int. J. High Perform. Comput. Appl..

[77]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[78]  Marina C. Chen,et al.  Compiling Communication-Efficient Programs for Massively Parallel Machines , 1991, IEEE Trans. Parallel Distributed Syst..

[79]  Tarek S. Abdelrahman,et al.  Automatic partitioning of data and computations on scalable shared memory multiprocessors , 1997, Proceedings of the 1997 International Conference on Parallel Processing (Cat. No.97TB100162).

[80]  Manish Gupta,et al.  Automatic Data Partitioning on Distributed Memory Multicomputers , 1992 .

[81]  Marina C. Chen,et al.  The Data Alignment Phase in Compiling Programs for Distrubuted-Memory Machines , 1991, J. Parallel Distributed Comput..

[82]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[83]  Michael Gerndt,et al.  Updating Distributed Variables in Local Computations , 1990, Concurr. Pract. Exp..

[84]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[85]  Guang R. Gao,et al.  Scheduling and mapping: software pipelining in the presence of structural hazards , 1995, PLDI '95.

[86]  Vasanth Balasundaram A Mechanism for Keeping Useful Internal Information in Parallel Programming Tools: The Data Access Descriptor , 1990, J. Parallel Distributed Comput..

[87]  Skef Wholey Automatic data mapping for distributed-memory parallel computers , 1992, ICS '92.

[88]  Prithviraj Banerjee,et al.  Compiler techniques for optimizing communication and data distribution for distributed-memory multicomputers , 1996 .

[89]  Thomas Fahringer,et al.  A static parameter based performance prediction tool for parallel programs , 1993, ICS '93.

[90]  Keshav Pingali,et al.  Solving Alignment Using Elementary Linear Algebra , 1994, LCPC.

[91]  Prithviraj Banerjee,et al.  Automatic Selection of Dynamic Data Partitioning Schemes for Distributed-Memory Multicomputers , 1995, LCPC.

[92]  William Pugh,et al.  Minimizing communication while preserving parallelism , 1996, ICS '96.

[93]  Santosh G. Abraham,et al.  Compiler techniques for data partitioning of sequentially iterated parallel loops , 1990, ICS '90.