On the Quest for Perfect Load Balance in Loop-Based Parallel Computations

Loop structures are a potentially rich source of parallelism in programs written in a high-level programming language, such as Fortran. Parallelisation of loop structures, by assigning and executing di erent loop iterations to and on each processor of a parallel computer, may lead to dramatic improvements in performance. Parallelising compilers aim to exploit this potential by converting a sequential program into a semantically equivalent parallel form, by means of a sequence of appropriately selected transformations. In order to achieve this, one necessity is mapping schemes which distribute the computational work, embodied in the parallel loop, across the multiple processors as evenly as possible. Ideally, each processor is assigned exactly the same amount of computational work, in which case perfect load balance is achieved; otherwise, some load imbalance is said to exist. This thesis investigates the extent to which perfect load balance can be attained when parallelising members of the class of loop nests which contain bounds that are either constant or linear expressions involving the indices of the surrounding loops. First, an algorithm for counting the number of iterations of a given loop nest is developed. This is capable of handling symbolic variables; that is, variables whose value is not known at compile-time. The resulting, possibly symbolic, count can be used to provide estimates for the execution time of the loop nest. Using this algorithm as a basis for the quantitative evaluation of load imbalance, the main body of the thesis develops a compile-time load balancing strategy for mapping members of this class of loop nests. This strategy associates an appropriate mapping scheme with each loop nest depending on the amount of computational work contained within it. At the heart of the strategy, a connection with an old problem of Number Theory, the Prouhet-Tarry-Escott problem, is established. Finally, a comparative analysis of related mapping schemes is conducted. Experimental results on a virtual shared memory parallel computer, the KSR1, show that, in many circumstances, the strategy proposed in this thesis achieves better performance. 7 Declaration No portion of the work referred to in this thesis has been submitted in support of an application for another degree or quali cation of this or any other university or other institute of learning. 8 Copyright 1. Copyright in text of this thesis rests with the Author. Copies (by any process) either in full, or of extracts, may be made only in accordance with instructions given by the Author and lodged in the John Rylands University Library of Manchester. Details may be obtained from the Librarian. This page must form part of any such copies made. Further copies (by any process) of copies made in accordance with such instructions may not be made without the permission (in writing) of the Author. 2. The ownership of any intellectual property rights which may be described in this thesis is vested in the University of Manchester, subject to any prior agreement to the contrary, and may not be made available for use by third parties without the written permission of the University, which will prescribe the terms and conditions of any such agreement. Further information on the conditions under which disclosures and exploitation may take place is available from the Head of Department of Computer Science. 9 Stouc gone•c mou, Iwnnh kai El‘nh To my parents, Ioannis and Eleni 10 Acknowledgements This thesis could not have been completed without the help of a number of people who made it possible; it is a pleasure to acknowledge them. My supervisor, Professor John Gurd, has been an invaluable source of continuous support, advice and encouragement. I am particularly grateful to him for his patience and his rigorous attention when writing this thesis. This work owes much to all past and present members of the Centre for Novel Computing, who were always being prepared to discuss and provide answers to my questions. In particular, I would like to thank my o ce mates in room 2.126 of the Department of Computer Science. Mike O'Boyle, Gholam Hedayat, Zbigniew Chamski, and the numerous discussions I had with them in the last four years, contributed signi cantly to the ideas expressed in this thesis; I am grateful to Mike especially for his help, in various ways, during the last months this thesis was being written. Henry Okora Okoyo and Armando Fortuna contributed to the creation of a stimulating environment in which it was a pleasure to work. Special thanks are also due to Elena St ohr for constructive comments on an earlier draft of this thesis. During the years that research for this thesis was being undertaken, as well as in the years that led to this stage, there were many people who helped, in their own way, and to whom I am grateful. However, there are two persons whose sacri ces have been by far unparalleled; these are my parents, who, alongside my sister, have always been an inexhaustible source of support. This thesis is dedicated to them. Finally, I am indebted to the State Scholarships Foundation of Greece ('Idruma Kratik”n Upotrofi”n I.K.U.) for providing nancial support. This thesis was set using the LATEXdocument preparation system. The use ofMathematica for carrying out several of the computations presented in the text is also acknowledged. 11 poll ̈ t ̈ dein ̈ koŽd©n ˆnjr”pou dein“teron p‘lei; Sofokl~hc, >Antig“nh many wonders there be, but naught more wondrous than man. Sophocles, Antigone 12 Notation The notation used in the thesis is rather standard; for easy reference, the symbols used are listed below, along with a short explanation: bxc the greatest integer less than or equal to x. dxe the least integer greater than or equal to x. m j n m divides n, i.e. there exists an integer k such that n = mk, m;n integers. m n m does not divide n, i.e. there exists no integer k such that n = mk, m;n integers. gcd(m;n) The Greatest Common Divisor of m and n. [a; b] All the values of x (real or integer) such that a x b. max l i u(xi) The maximum value of xi for all integer values of i in [l; u]. sign(x) It returns 1, 0, 1, depending on whether x is negative, zero, or positive, respectively. ^ Logical and. _ Logical or. A \ B Set intersection. A [ B Set union. A number of loop-related terms dominate much of the thesis. Although these have been established in the literature, the reader who is unfamiliar may consult Section 1.4.2.2. Particular mention is made to the notion of a canonical loop nest, introduced in this thesis, which is explained by De nitions 4.1 and 4.2. Finally, throughout the thesis, the end of the proofs of theorems and lemmata is marked by QED, an abbreviation of the Latin phrase Quod Erat Demonstrandum (i.e. which was to be proved). The symbol 2 is used to mark the end of examples. 13 Chapter

[1]  Werner Struckmann,et al.  Parc++: A parallel C++ , 1995, Softw. Pract. Exp..

[2]  Rudolf Eigenmann,et al.  Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..

[3]  David R. Wallace,et al.  Dependence of multi-dimensional array references , 1988, ICS '88.

[4]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[5]  Michael F. P. O'Boyle,et al.  Expert Programmer versus Parallelizing Compiler: A Comparative Study of Two Approaches for Distributed Shared Memory , 1996, Sci. Program..

[6]  Geoffrey C. Fox,et al.  Parallel Computing Works , 1994 .

[7]  Manish Gupta Automatic data partitioning on distributed memory multicomputers. Ph.D. Thesis , 1992 .

[8]  Andrew A. Chien,et al.  Analysis of Dynamic Structures for Efficient Parallel Execution , 1993, LCPC.

[9]  Tim J. Harris,et al.  A survey of PRAM simulation techniques , 1994, CSUR.

[10]  Rudolf Eigenmann,et al.  Symbolic analysis techniques for effective automatic parallelization , 1995 .

[11]  Jake K. Aggarwal,et al.  A Generalized Scheme for Mapping Parallel Algorithms , 1993, IEEE Trans. Parallel Distributed Syst..

[12]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[13]  William Pugh,et al.  A practical algorithm for exact array dependence analysis , 1992, CACM.

[14]  Steven Mark Carr,et al.  Memory-hierarchy management , 1993 .

[15]  Barbara M. Chapman,et al.  Programming in Vienna Fortran , 1992, Sci. Program..

[16]  Rudolf Eigenmann,et al.  An Overview of Symbolic Analysis Techniques Needed for the Effective Parallelization of the Perfect Benchmarks , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.

[17]  Skef Wholey Automatic data mapping for distributed-memory parallel computers , 1992, ICS '92.

[18]  Weijia Shang,et al.  On Loop Transformations for Generalized Cycle Shrinking , 1994, IEEE Trans. Parallel Distributed Syst..

[19]  Michel Cosnard,et al.  Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[20]  David B. Loveman Program improvement by source to source transformation , 1976, POPL '76.

[21]  Edith Schonberg,et al.  Factoring: a method for scheduling parallel loops , 1992 .

[22]  Inmos Limited,et al.  OCCAM 2 reference manual , 1988 .

[23]  Ken Kennedy,et al.  Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.

[24]  Alan H. Karp,et al.  Measuring parallel processor performance , 1990, CACM.

[25]  Rudolf Eigenmann,et al.  The range test: a dependence test for symbolic, non-linear expressions , 1994, Proceedings of Supercomputing '94.

[26]  E. M. Wright,et al.  Prouhet's 1851 Solution of the Tarry-Escott Problem of 1910 , 1959 .

[27]  Hans P. Zima,et al.  Compiling for distributed-memory systems , 1993 .

[28]  Dan I. Moldovan,et al.  Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.

[29]  Cherri M. Pancake,et al.  Software Support for Parallel Computing: Where Are We headed? , 1991 .

[30]  Martin E. Dyer,et al.  A Random Polynomial Time Algorithm for Approximating the Volume of Convex Bodies , 1989, STOC.

[31]  R. Sakellariou,et al.  A COMPUTATIONAL STUDY OF PARALLEL ALGORITHMS FOR THE ALL-PAIRS SHORTEST PATH PROBLEM , 1994 .

[32]  G. A. Hedayat,et al.  Interactive visualization of high-dimension iteration and data sets , 1995, Programming Models for Massively Parallel Computers.

[33]  Philippe Clauss,et al.  Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .

[34]  Vance Faber,et al.  Comments on the paper "parallel efficiency can be greater than unity" , 1987, Parallel Comput..

[35]  Paul Feautrier,et al.  Processor allocation and loop scheduling on multiprocessor computers , 1992, ICS '92.

[36]  Utpal Banerjee,et al.  Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.

[37]  David K. Smith Theory of Linear and Integer Programming , 1987 .

[38]  Jack J. Dongarra,et al.  A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..

[39]  William Pugh,et al.  Determining schedules based on performance estimation , 1993 .

[40]  Xiaodong Zhang,et al.  Spin-lock synchronization on the Butterfly and KSR1 , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[41]  Michael A. Driscoll,et al.  Accurate Predictions of Parallel Program Execution Time , 1995, J. Parallel Distributed Comput..

[42]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[43]  Michael Gerndt,et al.  SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..

[44]  Ten Hwan Tzen Advanced loop parallelization: dependence uniformization and trapezoid self-scheduling , 1992 .

[45]  Thomas Fahringer Estimating and Optimizing Performance for Parallel Programs , 1995, Computer.

[46]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[47]  Constantine D. Polychronopoulos,et al.  Symbolic analysis for parallelizing compilers , 1996, TOPL.

[48]  Donald E. Knuth,et al.  Big Omicron and big Omega and big Theta , 1976, SIGA.

[49]  William Pugh,et al.  Finding Legal Reordering Transformations Using Mappings , 1994, LCPC.

[50]  Christine Eisenbeis,et al.  A general algorithm for data dependence analysis , 1992, ICS '92.

[51]  Evangelos P. Markatos Scheduling for locality in shared-memory multiprocessors , 1993 .

[52]  Ken Kennedy,et al.  Practical dependence testing , 1991, PLDI '91.

[53]  S. Graham,et al.  Compiler Transformations for High-Performance , 1993 .

[54]  Geoffrey C. Fox,et al.  The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..

[55]  David C. Cann,et al.  Retire Fortran?: a debate rekindled , 1992, CACM.

[56]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[57]  D. Fischer,et al.  On superlinear speedups , 1991, Parallel Comput..

[58]  Michael Wolfe,et al.  Beyond induction variables , 1992, PLDI '92.

[59]  Nicolas Paris Pompc: A C Language For Data Parallelism , 1993 .

[60]  Lawrence A. Crowl How to measure, present, and compare parallel performance , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[61]  J. Ramanujam,et al.  Non-unimodular transformations of nested loops , 1992, Proceedings Supercomputing '92.

[62]  Jingling Xue Automating Non-Unimodular Loop Transformations for Massive Parallelism , 1994, Parallel Comput..

[63]  Corinne Ancourt,et al.  Scanning polyhedra with DO loops , 1991, PPOPP '91.

[64]  David J. Evans,et al.  Inter-Procedural Analysis for Parallel Computing , 1995, Parallel Comput..

[65]  Geoffrey C. Fox,et al.  On the parallelization of blocked LU factorization algorithms on distributed memory architectures , 1992, Proceedings Supercomputing '92.

[66]  Michael J. Flynn,et al.  Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.

[67]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[68]  Michael F. P. O'Boyle A Data Partitioning Algorithm for Distributed Memory Compilation , 1994, PARLE.

[69]  J. L. Schonfelder,et al.  Programming in FORTRAN 90 , 1993 .

[70]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[71]  Richard M. Karp,et al.  Parallel Algorithms for Shared-Memory Machines , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.

[72]  David Padua,et al.  Machine-Independent Evaluation of Parallelizing Compilers , 1992 .

[73]  K. A. Gallivan,et al.  Parallel Algorithms for Dense Linear Algebra Computations , 1990, SIAM Rev..

[74]  Kenneth E. Iverson,et al.  A programming language , 1899, AIEE-IRE '62 (Spring).

[75]  Ken Kennedy,et al.  Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.

[76]  David L. Presberg,et al.  The Paralyzer: Ivtran's Parallelism Analyzer and Synthesizer , 1975, Programming Languages and Compilers for Parallel and Vector Machines.

[77]  Michael F. P. O'Boyle,et al.  Load Balancing of Parallel Affine Loops by Unimodular Transformations , 1992 .

[78]  Lawrence Rauchwerger,et al.  Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[79]  David J. Lilja Exploiting the parallelism available in loops , 1994, Computer.

[80]  Nadia Tawbi Estimation of nested loops execution time by integer arithmetic in convex polyhedra , 1994, Proceedings of 8th International Parallel Processing Symposium.

[81]  Jacques Cohen,et al.  Automating program analysis , 1988, JACM.

[82]  P. Sadayappan,et al.  Communication-Free Hyperplane Partitioning of Nested Loops , 1991, LCPC.

[83]  Paul Havlak,et al.  Interprocedural symbolic analysis , 1995 .

[84]  David A. Padua,et al.  Advanced compiler optimizations for supercomputers , 1986, CACM.

[85]  Joe D. Warren,et al.  The program dependence graph and its use in optimization , 1987, TOPL.

[86]  Ben Wegbreit,et al.  Mechanical program analysis , 1975, CACM.

[87]  Chau-Wen Tseng,et al.  The Power Test for Data Dependence , 1992, IEEE Trans. Parallel Distributed Syst..

[88]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[89]  Ken Kennedy,et al.  Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.

[90]  Peiyi Tang,et al.  Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.

[91]  Michael F. P. O'Boyle,et al.  A Compiler Strategy for Shared Virtual Memories , 1996 .

[92]  Christian Lengauer,et al.  Unimodularity Considered Non-Essential , 1992, CONPAR.

[93]  Michael F. P. O'Boyle,et al.  Practical loop generation , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.

[94]  CONSTANTINE D. POLYCHRONOPOULOS,et al.  Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.

[95]  Michael E. Wolf,et al.  Improving locality and parallelism in nested loops , 1992 .

[96]  Constantine D. Polychronopoulos,et al.  Parallel programming and compilers , 1988 .

[97]  Monica S. Lam,et al.  Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.

[98]  P. Feautrier Array expansion , 1988 .

[99]  Jingke Li,et al.  Index domain alignment: minimizing cost of cross-referencing between distributed arrays , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.

[100]  G. M.,et al.  The Thirteen Books of Euclid's Elements , 1909, Nature.

[101]  Donald E. Knuth,et al.  An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..

[102]  Vivek Sarkar,et al.  Experiences using control dependence in PTRAN , 1990 .

[103]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[104]  Alexander I. Barvinok Computing the volume, counting integral points, and exponential sums , 1993, Discret. Comput. Geom..

[105]  Manish Gupta,et al.  Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..

[106]  Yves Robert,et al.  Revisiting cycle shrinking , 1992, Parallel Comput..

[107]  William Pugh,et al.  Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.

[108]  W. Pugh,et al.  A framework for unifying reordering transformations , 1993 .

[109]  Zhiyuan Li,et al.  An Efficient Data Dependence Analysis for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[110]  Nadia Tawbi Parallelisation automatique : estimation des durees d'execution et allocation statique de processeurs , 1991 .

[111]  David J. Kuck,et al.  A Survey of Parallel Machine Organization and Programming , 1977, CSUR.

[112]  Vadim Maslov,et al.  Delinearization: an efficient way to break multiloop dependence equations , 1992, PLDI '92.

[113]  T. A. A. Broadbent,et al.  Diophantus of Alexandria , 1966, The Mathematical Gazette.

[114]  Multiprocessors Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .

[115]  Ulrich Kremer,et al.  NP-completeness of Dynamic Remapping , 1993 .

[116]  Alan H. Karp,et al.  A comparison of 12 parallel FORTRAN dialects , 1988, IEEE Software.

[117]  Alexandru Nicolau,et al.  A general data dependence test for dynamic, pointer-based data structures , 1994, PLDI '94.

[118]  Chris R. Jesshope,et al.  Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .

[119]  Karen Lee Pieper Parallelizing compilers: implementation and effectiveness , 1993 .

[120]  Tarek S. Abdelrahman,et al.  Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..

[121]  Kathryn S. McKinley,et al.  Automatic and interactive parallelization , 1992 .

[122]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[123]  Gene H. Golub,et al.  Matrix computations , 1983 .

[124]  Vivek Sarkar,et al.  Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .

[125]  Ken Kennedy,et al.  Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..

[126]  Zdenek Hanzalek Parallel processing: From applications to systems , 1997 .

[127]  Barbara M. Chapman,et al.  Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.

[128]  W. J. Worlton Toward a science of parallel computation , 1986 .

[129]  Michael Wolfe,et al.  High performance compilers for parallel computing , 1995 .

[130]  W. Daniel Hillis,et al.  The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.

[131]  Lubomir F. Bic,et al.  Automatic Parallelization Techniques for the EM-4 , 1993, 1993 International Conference on Parallel Processing - ICPP'93.

[132]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[133]  Vipin Kumar,et al.  Isoefficiency: measuring the scalability of parallel algorithms and architectures , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[134]  David A. Padua,et al.  Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.

[135]  Lawrence Rauchwerger,et al.  Parallelizing while loops for multiprocessor systems , 1995, Proceedings of 9th International Parallel Processing Symposium.

[136]  Zhiyuan Li Array privatization for parallel execution of loops , 1992, ICS.

[137]  Jack J. Dongarra,et al.  Matrix Eigensystem Routines — EISPACK Guide Extension , 1977, Lecture Notes in Computer Science.

[138]  Zhiwei Xu,et al.  Modeling communication overhead: MPI and MPL performance on the IBM SP2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..

[139]  Thomas J. LeBlanc,et al.  Parallel performance prediction using lost cycles analysis , 1994, Proceedings of Supercomputing '94.

[140]  Ewald Speckenmeyer,et al.  Is Average Superlinear Speedup Possible? , 1988, CSL.

[141]  Rudolf Eigenmann,et al.  Automatic program parallelization , 1993, Proc. IEEE.

[142]  Jacques Cohen,et al.  Two Algorithms for Determining Volumes of Convex Polyhedra , 1979, JACM.

[143]  Wei Li Compiler Optimizations for Cache Locality and Coherence , 1994 .

[144]  M. Fischer,et al.  SUPER-EXPONENTIAL COMPLEXITY OF PRESBURGER ARITHMETIC , 1974 .

[145]  Alan Weiss,et al.  Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.

[146]  Xian-He Sun,et al.  Toward a better parallel performance metric , 1991, Parallel Comput..

[147]  Jagdish J. Modi,et al.  Parallel algorithms and matrix computation , 1988 .

[148]  Utpal Banerjee,et al.  Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.

[149]  L.M. Ni,et al.  Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..

[150]  Kleanthis Psarris,et al.  The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization , 1991, IEEE Trans. Parallel Distributed Syst..

[151]  William Pugh,et al.  Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.

[152]  Donald E. Knuth The art of computer programming: fundamental algorithms , 1969 .

[153]  David J. Lilja,et al.  Parameter estimation for a generalized parallel loop scheduling algorithm , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.

[154]  Monica S. Lam,et al.  Interprocedural Analysis for Parallelization , 1995, LCPC.

[155]  Michael D. Rice,et al.  Modeling the Serial and Parallel Fractions of a Parallel Algorithm , 1991, J. Parallel Distributed Comput..

[156]  Thomas R. Gross,et al.  Task Parallelism in a High Performance Fortran Framework , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.

[157]  L. Mordell,et al.  Diophantine equations , 1969 .

[158]  Zhiyu Shen,et al.  An Empirical Study of Fortran Programs for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..

[159]  Ko-Yang Wang Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.

[160]  Xian-He Sun,et al.  Scalability of Parallel Algorithm-Machine Combinations , 1994, IEEE Trans. Parallel Distributed Syst..

[161]  G. Hardy,et al.  An Introduction to the Theory of Numbers , 1938 .

[162]  Michael G. Norman,et al.  Models of machines and computation for mapping in multicomputers , 1993, CSUR.

[163]  Lawrence S. Kroll Mathematica--A System for Doing Mathematics by Computer. , 1989 .

[164]  Michael Wolfe,et al.  Interprocedural alias analysis: Implementation and empirical results , 1993, Softw. Pract. Exp..

[165]  J. Lawrence Polytope volume computation , 1991 .

[166]  Utpal Banerjee,et al.  A theory of loop permutations , 1990 .

[167]  Yong Yan,et al.  Latency Metric: An Experimental Method for Measuring and Evaluating Parallel Program and Architecture Scalability , 1994, J. Parallel Distributed Comput..

[168]  Helmar Burkhart,et al.  Performance-Measurement Tools in a Multiprocessor Environment , 1989, IEEE Trans. Computers.

[169]  M. R. Spiegel Mathematical handbook of formulas and tables , 1968 .

[170]  L HennessyJohn,et al.  Efficient and exact data dependence analysis , 1991 .

[171]  Constantine D. Polychronopoulos,et al.  Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.

[172]  Michael O'Boyle,et al.  Program and data transformations for efficient execution on distributed memory architectures , 1993, Technical report series.

[173]  Philip J. Hatcher,et al.  Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..

[174]  Ron Cytron,et al.  An Overview of the PTRAN Analysis System for Multiprocessing , 1988, J. Parallel Distributed Comput..

[175]  Zvi M. Kedem,et al.  Mapping Nested Loop Algorithms into Multidimensional Systolic Arrays , 2017, IEEE Trans. Parallel Distributed Syst..

[176]  Alexander V. Veidenbaum,et al.  The effect of restructing compilers on program performance for high-speed computers☆ , 1985 .

[177]  Mark Crovella,et al.  Performance Prediction and Tuning of Parallel Programs , 1994 .

[178]  Martin E. Dyer,et al.  On the Complexity of Computing the Volume of a Polyhedron , 1988, SIAM J. Comput..