Optimizing for parallelism and data locality

Previous research has used program transformation to introduce parallelism and to exploit data locality. Unfortunately, these two objectives have usually been considered independently. This work explores the trade-offs between effectively utilizing parallelism and memory hierarchy on shared-memory multiprocessors. We present a simple, but surprisingly accurate, memory model to determine cache line reuse from both multiple accesses to the same memory location and from consecutive memory access. The model is used in memory optimizing and loop parallelization algorithms that effectively exploit data locality and parallelism in concert. We demonstrate the efficacy of this approach with very encouraging experimental results.

[1]  Michael Wolfe,et al.  More iteration space tiling , 1989, Proceedings of the 1989 ACM/IEEE Conference on Supercomputing (Supercomputing '89).

[2]  George Cybenko,et al.  Supercomputer performance evaluation and the Perfect Benchmarks , 1990, ICS '90.

[3]  Ken Kennedy,et al.  Software methods for improvement of cache performance on supercomputer applications , 1989 .

[4]  Vivek Sarkar,et al.  On Estimating and Enhancing Cache Effectiveness , 1991, LCPC.

[5]  Yoichi Muraoka,et al.  On the Number of Operations Simultaneously Executable in Fortran-Like Programs and Their Resulting Speedup , 1972, IEEE Transactions on Computers.

[6]  Jack J. Dongarra,et al.  An extended set of FORTRAN basic linear algebra subprograms , 1988, TOMS.

[7]  Ken Kennedy,et al.  Automatic decomposition of scientific programs for parallel execution , 1987, POPL '87.

[8]  Ii C. D. Callahan A global approach to detection of parallelism , 1987 .

[9]  Leslie Lamport,et al.  The parallel execution of DO loops , 1974, CACM.

[10]  François Irigoin,et al.  Supernode partitioning , 1988, POPL '88.

[11]  Utpal Banerjee,et al.  A theory of loop permutations , 1990 .

[12]  V. Klema LINPACK user's guide , 1980 .

[13]  D LamMonica,et al.  The cache performance and optimizations of blocked algorithms , 1991 .

[14]  Ken Kennedy,et al.  Automatic translation of FORTRAN programs to vector form , 1987, TOPL.

[15]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[16]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[17]  David A. Padua,et al.  Dependence graphs and compiler optimizations , 1981, POPL '81.

[18]  J. Huisman The Netherlands , 1996, The Lancet.

[19]  LamportLeslie The parallel execution of DO loops , 1974 .

[20]  Walid Abu-Sufah,et al.  Improving the performance of virtual memory computers. , 1979 .

[21]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[22]  Ken Kennedy,et al.  A static performance estimator in the Fortran D programming system , 1992 .

[23]  Kathryn S. McKinley,et al.  Automatic and interactive parallelization , 1992 .