Improving locality and parallelism in nested loops
暂无分享,去创建一个
[1] A. Aiken,et al. Loop Quantization: an Analysis and Algorithm , 1987 .
[2] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.
[3] Michael L. Dowling. Optimal code parallelization using unimodular transformations , 1990, Parallel Comput..
[4] Steven W. K. Tjiang,et al. Automatic generation of data-flow analyzers : a tool for building optimizers , 1993 .
[5] J. Ramanujam,et al. Tiling multidimensional iteration spaces for nonshared memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).
[6] W. ABU-SUFAH,et al. Automatic program transformations for virtual memory computers * , 1899, 1979 International Workshop on Managing Requirements Knowledge (MARK).
[7] Utpal Banerjee,et al. Dependence analysis for supercomputing , 1988, The Kluwer international series in engineering and computer science.
[8] M. Schlansker,et al. The Cydra 5 computer system architecture , 1988, Proceedings 1988 IEEE International Conference on Computer Design: VLSI.
[9] Ken Kennedy,et al. PFC: A Program to Convert Fortran to Parallel Form , 1982 .
[10] Ronald Gary Cytron. Compile-time scheduling and optimization for asynchronous machines (multiprocessor, compiler, parallel processing) , 1984 .
[11] Michael Wolfe,et al. More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).
[12] Dan I. Moldovan,et al. Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays , 1986, IEEE Transactions on Computers.
[13] Ken Kennedy,et al. Automatic translation of FORTRAN programs to vector form , 1987, TOPL.
[14] Monica S. Lam,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[15] J M Delosme,et al. Efficient Systolic Arrays for the Solution of Toeplitz Systems: An Illustration of a Methodology for the Construction of Systolic Architectures in VLSI (Very Large Systems Integration). , 1985 .
[16] Jack J. Dongarra,et al. A set of level 3 basic linear algebra subprograms , 1990, TOMS.
[17] Robert P. Colwell,et al. A VLIW architecture for a trace scheduling compiler , 1987, ASPLOS.
[18] Ken Kennedy,et al. Estimating Interlock and Improving Balance for Pipelined Architectures , 1988, J. Parallel Distributed Comput..
[19] Ken Kennedy,et al. Improving register allocation for subscripted variables , 1990, PLDI '90.
[20] Geoffrey C. Fox,et al. The Perfect Club Benchmarks: Effective Performance Evaluation of Supercomputers , 1989, Int. J. High Perform. Comput. Appl..
[21] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .
[22] Ken Kennedy,et al. Software methods for improvement of cache performance on supercomputer applications , 1989 .
[23] James L. Elshoff. Some programming techniques for processing multi-dimensional matrices in a paging environment , 1974, AFIPS '74.
[24] Dan I. Moldovan,et al. Parallelism detection and transformation techniques useful for VLSI algorithms , 1985, J. Parallel Distributed Comput..
[25] Donald A. Calahan,et al. Block-Oriented, Local-Memory Based Linear Equation Solution on the Cray-2 Uniprocessor Algorithms , 1986, ICPP.
[26] Norman P. Jouppi,et al. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers , 1990, [1990] Proceedings. The 17th Annual International Symposium on Computer Architecture.
[27] A. C. McKellar,et al. The organization of matrices and matrix operations in a paged multiprogramming environment , 1968 .
[28] François Irigoin,et al. Supernode partitioning , 1988, POPL '88.
[29] Jack Dongarra,et al. Automatic Blocking of Nested Loops , 1990 .
[30] Michael Wolfe,et al. Beyond induction variables , 1992, PLDI '92.
[31] Dennis Gannon,et al. On the problem of optimizing data transfers for complex memory systems , 1988, ICS '88.
[32] Patrice Quinton. Automatic synthesis of systolic arrays from uniform recurrent equations , 1984, ISCA '84.
[33] Chau-Wen Tseng,et al. The Power Test for Data Dependence , 1992, IEEE Trans. Parallel Distributed Syst..
[34] Duncan H. Lawrie,et al. On the Performance Enhancement of Paging Systems Through Program Analysis and Transformations , 1981, IEEE Transactions on Computers.
[35] Walid Abu-Sufah,et al. Improving the performance of virtual memory computers. , 1979 .
[36] Martine Ancourt. Generation automatique de codes de transfert pour multiprocesseurs a memoires locales , 1991 .
[37] Monica S. Lam,et al. Efficient and exact data dependence analysis , 1991, PLDI '91.
[38] Dan I. Moldovan,et al. ADVIS: A Software Package for the Design of Systolic Arrays , 1987, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
[39] Leslie Lamport,et al. The parallel execution of DO loops , 1974, CACM.
[40] William Jalby,et al. Impact of Hierarchical Memory Systems On Linear Algebra Algorithm Design , 1988 .
[41] Steven W. K. Tjiang,et al. Integrating Scalar Optimization and Parallelization , 1991, LCPC.
[42] O. A. Olukotun,et al. Implementing a cache for a high-performance GaAs microprocessor , 1991, [1991] Proceedings. The 18th Annual International Symposium on Computer Architecture.
[43] Corinne Ancourt,et al. Scanning polyhedra with DO loops , 1991, PPOPP '91.