On the Quest for Perfect Load Balance in Loop-Based Parallel Computations
暂无分享,去创建一个
[1] Werner Struckmann,et al. Parc++: A parallel C++ , 1995, Softw. Pract. Exp..
[2] Rudolf Eigenmann,et al. Performance Analysis of Parallelizing Compilers on the Perfect Benchmarks Programs , 1992, IEEE Trans. Parallel Distributed Syst..
[3] David R. Wallace,et al. Dependence of multi-dimensional array references , 1988, ICS '88.
[4] Chau-Wen Tseng,et al. Compiler optimizations for improving data locality , 1994, ASPLOS VI.
[5] Michael F. P. O'Boyle,et al. Expert Programmer versus Parallelizing Compiler: A Comparative Study of Two Approaches for Distributed Shared Memory , 1996, Sci. Program..
[6] Geoffrey C. Fox,et al. Parallel Computing Works , 1994 .
[7] Manish Gupta. Automatic data partitioning on distributed memory multicomputers. Ph.D. Thesis , 1992 .
[8] Andrew A. Chien,et al. Analysis of Dynamic Structures for Efficient Parallel Execution , 1993, LCPC.
[9] Tim J. Harris,et al. A survey of PRAM simulation techniques , 1994, CSUR.
[10] Rudolf Eigenmann,et al. Symbolic analysis techniques for effective automatic parallelization , 1995 .
[11] Jake K. Aggarwal,et al. A Generalized Scheme for Mapping Parallel Algorithms , 1993, IEEE Trans. Parallel Distributed Syst..
[12] Chau-Wen Tseng,et al. Improving data locality with loop transformations , 1996, TOPL.
[13] William Pugh,et al. A practical algorithm for exact array dependence analysis , 1992, CACM.
[14] Steven Mark Carr,et al. Memory-hierarchy management , 1993 .
[15] Barbara M. Chapman,et al. Programming in Vienna Fortran , 1992, Sci. Program..
[16] Rudolf Eigenmann,et al. An Overview of Symbolic Analysis Techniques Needed for the Effective Parallelization of the Perfect Benchmarks , 1994, 1994 Internatonal Conference on Parallel Processing Vol. 2.
[17] Skef Wholey. Automatic data mapping for distributed-memory parallel computers , 1992, ICS '92.
[18] Weijia Shang,et al. On Loop Transformations for Generalized Cycle Shrinking , 1994, IEEE Trans. Parallel Distributed Syst..
[19] Michel Cosnard,et al. Automatic task graph generation techniques , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.
[20] David B. Loveman. Program improvement by source to source transformation , 1976, POPL '76.
[21] Edith Schonberg,et al. Factoring: a method for scheduling parallel loops , 1992 .
[22] Inmos Limited,et al. OCCAM 2 reference manual , 1988 .
[23] Ken Kennedy,et al. Automatic Data Layout Using 0-1 Integer Programming , 1994, IFIP PACT.
[24] Alan H. Karp,et al. Measuring parallel processor performance , 1990, CACM.
[25] Rudolf Eigenmann,et al. The range test: a dependence test for symbolic, non-linear expressions , 1994, Proceedings of Supercomputing '94.
[26] E. M. Wright,et al. Prouhet's 1851 Solution of the Tarry-Escott Problem of 1910 , 1959 .
[27] Hans P. Zima,et al. Compiling for distributed-memory systems , 1993 .
[28] Dan I. Moldovan,et al. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.
[29] Cherri M. Pancake,et al. Software Support for Parallel Computing: Where Are We headed? , 1991 .
[30] Martin E. Dyer,et al. A Random Polynomial Time Algorithm for Approximating the Volume of Convex Bodies , 1989, STOC.
[31] R. Sakellariou,et al. A COMPUTATIONAL STUDY OF PARALLEL ALGORITHMS FOR THE ALL-PAIRS SHORTEST PATH PROBLEM , 1994 .
[32] G. A. Hedayat,et al. Interactive visualization of high-dimension iteration and data sets , 1995, Programming Models for Massively Parallel Computers.
[33] Philippe Clauss,et al. Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: applications to analyze and transform scientific programs , 1996 .
[34] Vance Faber,et al. Comments on the paper "parallel efficiency can be greater than unity" , 1987, Parallel Comput..
[35] Paul Feautrier,et al. Processor allocation and loop scheduling on multiprocessor computers , 1992, ICS '92.
[36] Utpal Banerjee,et al. Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.
[37] David K. Smith. Theory of Linear and Integer Programming , 1987 .
[38] Jack J. Dongarra,et al. A comparative study of automatic vectorizing compilers , 1991, Parallel Comput..
[39] William Pugh,et al. Determining schedules based on performance estimation , 1993 .
[40] Xiaodong Zhang,et al. Spin-lock synchronization on the Butterfly and KSR1 , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[41] Michael A. Driscoll,et al. Accurate Predictions of Parallel Program Execution Time , 1995, J. Parallel Distributed Comput..
[42] G. Amdhal,et al. Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).
[43] Michael Gerndt,et al. SUPERB: A tool for semi-automatic MIMD/SIMD parallelization , 1988, Parallel Comput..
[44] Ten Hwan Tzen. Advanced loop parallelization: dependence uniformization and trapezoid self-scheduling , 1992 .
[45] Thomas Fahringer. Estimating and Optimizing Performance for Parallel Programs , 1995, Computer.
[46] G. G. Stokes. "J." , 1890, The New Yale Book of Quotations.
[47] Constantine D. Polychronopoulos,et al. Symbolic analysis for parallelizing compilers , 1996, TOPL.
[48] Donald E. Knuth,et al. Big Omicron and big Omega and big Theta , 1976, SIGA.
[49] William Pugh,et al. Finding Legal Reordering Transformations Using Mappings , 1994, LCPC.
[50] Christine Eisenbeis,et al. A general algorithm for data dependence analysis , 1992, ICS '92.
[51] Evangelos P. Markatos. Scheduling for locality in shared-memory multiprocessors , 1993 .
[52] Ken Kennedy,et al. Practical dependence testing , 1991, PLDI '91.
[53] S. Graham,et al. Compiler Transformations for High-Performance , 1993 .
[54] Geoffrey C. Fox,et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..
[55] David C. Cann,et al. Retire Fortran?: a debate rekindled , 1992, CACM.
[56] David A. Padua,et al. Dependence graphs and compiler optimizations , 1981, POPL '81.
[57] D. Fischer,et al. On superlinear speedups , 1991, Parallel Comput..
[58] Michael Wolfe,et al. Beyond induction variables , 1992, PLDI '92.
[59] Nicolas Paris. Pompc: A C Language For Data Parallelism , 1993 .
[60] Lawrence A. Crowl. How to measure, present, and compare parallel performance , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[61] J. Ramanujam,et al. Non-unimodular transformations of nested loops , 1992, Proceedings Supercomputing '92.
[62] Jingling Xue. Automating Non-Unimodular Loop Transformations for Massive Parallelism , 1994, Parallel Comput..
[63] Corinne Ancourt,et al. Scanning polyhedra with DO loops , 1991, PPOPP '91.
[64] David J. Evans,et al. Inter-Procedural Analysis for Parallel Computing , 1995, Parallel Comput..
[65] Geoffrey C. Fox,et al. On the parallelization of blocked LU factorization algorithms on distributed memory architectures , 1992, Proceedings Supercomputing '92.
[66] Michael J. Flynn,et al. Some Computer Organizations and Their Effectiveness , 1972, IEEE Transactions on Computers.
[67] Ken Kennedy,et al. Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.
[68] Michael F. P. O'Boyle. A Data Partitioning Algorithm for Distributed Memory Compilation , 1994, PARLE.
[69] J. L. Schonfelder,et al. Programming in FORTRAN 90 , 1993 .
[70] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[71] Richard M. Karp,et al. Parallel Algorithms for Shared-Memory Machines , 1991, Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity.
[72] David Padua,et al. Machine-Independent Evaluation of Parallelizing Compilers , 1992 .
[73] K. A. Gallivan,et al. Parallel Algorithms for Dense Linear Algebra Computations , 1990, SIAM Rev..
[74] Kenneth E. Iverson,et al. A programming language , 1899, AIEE-IRE '62 (Spring).
[75] Ken Kennedy,et al. Loop distribution with arbitrary control flow , 1990, Proceedings SUPERCOMPUTING '90.
[76] David L. Presberg,et al. The Paralyzer: Ivtran's Parallelism Analyzer and Synthesizer , 1975, Programming Languages and Compilers for Parallel and Vector Machines.
[77] Michael F. P. O'Boyle,et al. Load Balancing of Parallel Affine Loops by Unimodular Transformations , 1992 .
[78] Lawrence Rauchwerger,et al. Automatic Detection of Parallelism: A grand challenge for high performance computing , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[79] David J. Lilja. Exploiting the parallelism available in loops , 1994, Computer.
[80] Nadia Tawbi. Estimation of nested loops execution time by integer arithmetic in convex polyhedra , 1994, Proceedings of 8th International Parallel Processing Symposium.
[81] Jacques Cohen,et al. Automating program analysis , 1988, JACM.
[82] P. Sadayappan,et al. Communication-Free Hyperplane Partitioning of Nested Loops , 1991, LCPC.
[83] Paul Havlak,et al. Interprocedural symbolic analysis , 1995 .
[84] David A. Padua,et al. Advanced compiler optimizations for supercomputers , 1986, CACM.
[85] Joe D. Warren,et al. The program dependence graph and its use in optimization , 1987, TOPL.
[86] Ben Wegbreit,et al. Mechanical program analysis , 1975, CACM.
[87] Chau-Wen Tseng,et al. The Power Test for Data Dependence , 1992, IEEE Trans. Parallel Distributed Syst..
[88] Leslie G. Valiant,et al. A bridging model for parallel computation , 1990, CACM.
[89] Ken Kennedy,et al. Compiling Fortran D for MIMD distributed-memory machines , 1992, CACM.
[90] Peiyi Tang,et al. Dynamic Processor Self-Scheduling for General Parallel Nested Loops , 1987, IEEE Trans. Computers.
[91] Michael F. P. O'Boyle,et al. A Compiler Strategy for Shared Virtual Memories , 1996 .
[92] Christian Lengauer,et al. Unimodularity Considered Non-Essential , 1992, CONPAR.
[93] Michael F. P. O'Boyle,et al. Practical loop generation , 1996, Proceedings of HICSS-29: 29th Hawaii International Conference on System Sciences.
[94] CONSTANTINE D. POLYCHRONOPOULOS,et al. Guided Self-Scheduling: A Practical Scheduling Scheme for Parallel Supercomputers , 1987, IEEE Transactions on Computers.
[95] Michael E. Wolf,et al. Improving locality and parallelism in nested loops , 1992 .
[96] Constantine D. Polychronopoulos,et al. Parallel programming and compilers , 1988 .
[97] Monica S. Lam,et al. Global optimizations for parallelism and locality on scalable parallel machines , 1993, PLDI '93.
[98] P. Feautrier. Array expansion , 1988 .
[99] Jingke Li,et al. Index domain alignment: minimizing cost of cross-referencing between distributed arrays , 1990, [1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation.
[100] G. M.,et al. The Thirteen Books of Euclid's Elements , 1909, Nature.
[101] Donald E. Knuth,et al. An empirical study of FORTRAN programs , 1971, Softw. Pract. Exp..
[102] Vivek Sarkar,et al. Experiences using control dependence in PTRAN , 1990 .
[103] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.
[104] Alexander I. Barvinok. Computing the volume, counting integral points, and exponential sums , 1993, Discret. Comput. Geom..
[105] Manish Gupta,et al. Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers , 1992, IEEE Trans. Parallel Distributed Syst..
[106] Yves Robert,et al. Revisiting cycle shrinking , 1992, Parallel Comput..
[107] William Pugh,et al. Counting solutions to Presburger formulas: how and why , 1994, PLDI '94.
[108] W. Pugh,et al. A framework for unifying reordering transformations , 1993 .
[109] Zhiyuan Li,et al. An Efficient Data Dependence Analysis for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..
[110] Nadia Tawbi. Parallelisation automatique : estimation des durees d'execution et allocation statique de processeurs , 1991 .
[111] David J. Kuck,et al. A Survey of Parallel Machine Organization and Programming , 1977, CSUR.
[112] Vadim Maslov,et al. Delinearization: an efficient way to break multiloop dependence equations , 1992, PLDI '92.
[113] T. A. A. Broadbent,et al. Diophantus of Alexandria , 1966, The Mathematical Gazette.
[114] Multiprocessors. Using Processor A � nity in Loop Scheduling on Shared Memory , 1994 .
[115] Ulrich Kremer,et al. NP-completeness of Dynamic Remapping , 1993 .
[116] Alan H. Karp,et al. A comparison of 12 parallel FORTRAN dialects , 1988, IEEE Software.
[117] Alexandru Nicolau,et al. A general data dependence test for dynamic, pointer-based data structures , 1994, PLDI '94.
[118] Chris R. Jesshope,et al. Parallel Computers 2: Architecture, Programming and Algorithms , 1981 .
[119] Karen Lee Pieper. Parallelizing compilers: implementation and effectiveness , 1993 .
[120] Tarek S. Abdelrahman,et al. Fusion of Loops for Parallelism and Locality , 1997, IEEE Trans. Parallel Distributed Syst..
[121] Kathryn S. McKinley,et al. Automatic and interactive parallelization , 1992 .
[122] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[123] Gene H. Golub,et al. Matrix computations , 1983 .
[124] Vivek Sarkar,et al. Partitioning and Scheduling Parallel Programs for Multiprocessing , 1989 .
[125] Ken Kennedy,et al. Evaluating Compiler Optimizations for Fortran D , 1994, J. Parallel Distributed Comput..
[126] Zdenek Hanzalek. Parallel processing: From applications to systems , 1997 .
[127] Barbara M. Chapman,et al. Supercompilers for parallel and vector computers , 1990, ACM Press frontier series.
[128] W. J. Worlton. Toward a science of parallel computation , 1986 .
[129] Michael Wolfe,et al. High performance compilers for parallel computing , 1995 .
[130] W. Daniel Hillis,et al. The CM-5 Connection Machine: a scalable supercomputer , 1993, CACM.
[131] Lubomir F. Bic,et al. Automatic Parallelization Techniques for the EM-4 , 1993, 1993 International Conference on Parallel Processing - ICPP'93.
[132] Vivek Sarkar,et al. On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.
[133] Vipin Kumar,et al. Isoefficiency: measuring the scalability of parallel algorithms and architectures , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.
[134] David A. Padua,et al. Automatic Array Privatization , 1993, Compiler Optimizations for Scalable Parallel Systems Languages.
[135] Lawrence Rauchwerger,et al. Parallelizing while loops for multiprocessor systems , 1995, Proceedings of 9th International Parallel Processing Symposium.
[136] Zhiyuan Li. Array privatization for parallel execution of loops , 1992, ICS.
[137] Jack J. Dongarra,et al. Matrix Eigensystem Routines — EISPACK Guide Extension , 1977, Lecture Notes in Computer Science.
[138] Zhiwei Xu,et al. Modeling communication overhead: MPI and MPL performance on the IBM SP2 , 1996, IEEE Parallel Distributed Technol. Syst. Appl..
[139] Thomas J. LeBlanc,et al. Parallel performance prediction using lost cycles analysis , 1994, Proceedings of Supercomputing '94.
[140] Ewald Speckenmeyer,et al. Is Average Superlinear Speedup Possible? , 1988, CSL.
[141] Rudolf Eigenmann,et al. Automatic program parallelization , 1993, Proc. IEEE.
[142] Jacques Cohen,et al. Two Algorithms for Determining Volumes of Convex Polyhedra , 1979, JACM.
[143] Wei Li. Compiler Optimizations for Cache Locality and Coherence , 1994 .
[144] M. Fischer,et al. SUPER-EXPONENTIAL COMPLEXITY OF PRESBURGER ARITHMETIC , 1974 .
[145] Alan Weiss,et al. Allocating Independent Subtasks on Parallel Processors , 1985, IEEE Transactions on Software Engineering.
[146] Xian-He Sun,et al. Toward a better parallel performance metric , 1991, Parallel Comput..
[147] Jagdish J. Modi,et al. Parallel algorithms and matrix computation , 1988 .
[148] Utpal Banerjee,et al. Loop Transformations for Restructuring Compilers: The Foundations , 1993, Springer US.
[149] L.M. Ni,et al. Trapezoid Self-Scheduling: A Practical Scheduling Scheme for Parallel Compilers , 1993, IEEE Trans. Parallel Distributed Syst..
[150] Kleanthis Psarris,et al. The I Test: An Improved Dependence Test for Automatic Parallelization and Vectorization , 1991, IEEE Trans. Parallel Distributed Syst..
[151] William Pugh,et al. Static analysis of upper and lower bounds on dependences and parallelism , 1994, TOPL.
[152] Donald E. Knuth. The art of computer programming: fundamental algorithms , 1969 .
[153] David J. Lilja,et al. Parameter estimation for a generalized parallel loop scheduling algorithm , 1995, Proceedings of the Twenty-Eighth Annual Hawaii International Conference on System Sciences.
[154] Monica S. Lam,et al. Interprocedural Analysis for Parallelization , 1995, LCPC.
[155] Michael D. Rice,et al. Modeling the Serial and Parallel Fractions of a Parallel Algorithm , 1991, J. Parallel Distributed Comput..
[156] Thomas R. Gross,et al. Task Parallelism in a High Performance Fortran Framework , 1994, IEEE Parallel & Distributed Technology: Systems & Applications.
[157] L. Mordell,et al. Diophantine equations , 1969 .
[158] Zhiyu Shen,et al. An Empirical Study of Fortran Programs for Parallelizing Compilers , 1990, IEEE Trans. Parallel Distributed Syst..
[159] Ko-Yang Wang. Precise compile-time performance prediction for superscalar-based computers , 1994, PLDI '94.
[160] Xian-He Sun,et al. Scalability of Parallel Algorithm-Machine Combinations , 1994, IEEE Trans. Parallel Distributed Syst..
[161] G. Hardy,et al. An Introduction to the Theory of Numbers , 1938 .
[162] Michael G. Norman,et al. Models of machines and computation for mapping in multicomputers , 1993, CSUR.
[163] Lawrence S. Kroll. Mathematica--A System for Doing Mathematics by Computer. , 1989 .
[164] Michael Wolfe,et al. Interprocedural alias analysis: Implementation and empirical results , 1993, Softw. Pract. Exp..
[165] J. Lawrence. Polytope volume computation , 1991 .
[166] Utpal Banerjee,et al. A theory of loop permutations , 1990 .
[167] Yong Yan,et al. Latency Metric: An Experimental Method for Measuring and Evaluating Parallel Program and Architecture Scalability , 1994, J. Parallel Distributed Comput..
[168] Helmar Burkhart,et al. Performance-Measurement Tools in a Multiprocessor Environment , 1989, IEEE Trans. Computers.
[169] M. R. Spiegel. Mathematical handbook of formulas and tables , 1968 .
[170] L HennessyJohn,et al. Efficient and exact data dependence analysis , 1991 .
[171] Constantine D. Polychronopoulos,et al. Symbolic Analysis: A Basis for Parallelization, Optimization, and Scheduling of Programs , 1993, LCPC.
[172] Michael O'Boyle,et al. Program and data transformations for efficient execution on distributed memory architectures , 1993, Technical report series.
[173] Philip J. Hatcher,et al. Data-Parallel Programming on MIMD Computers , 1991, IEEE Trans. Parallel Distributed Syst..
[174] Ron Cytron,et al. An Overview of the PTRAN Analysis System for Multiprocessing , 1988, J. Parallel Distributed Comput..
[175] Zvi M. Kedem,et al. Mapping Nested Loop Algorithms into Multidimensional Systolic Arrays , 2017, IEEE Trans. Parallel Distributed Syst..
[176] Alexander V. Veidenbaum,et al. The effect of restructing compilers on program performance for high-speed computers☆ , 1985 .
[177] Mark Crovella,et al. Performance Prediction and Tuning of Parallel Programs , 1994 .
[178] Martin E. Dyer,et al. On the Complexity of Computing the Volume of a Polyhedron , 1988, SIAM J. Comput..