A Cost Model For Integrated Restructuring Optimizations

Compilers must make choices between different optimizations; in this paper we present an analytic cost model that can be used to compare several compile-time optimizations for memory-intensive, matrix-based codes. These optimizations increase the spatial locality of references to improve cache hierarchy performance. Specifically, we consider loop transformations, array restructuring, and address remapping, as well as combinations thereof. Our cost model compares the effectiveness of these optimizations and provides a good basis for deciding which optimization to use. To evaluate the cost model and the decisions taken based on it, we simulate eight applications on a variety of input sizes and with a variety of manually applied restructuring optimizations. We find that a single fixed strategy delivers suboptimal performance, and that it is necessary to adjust the chosen optimization to each code. Our model generally predicts the best combination of restructuring optimizations among those we examined. The set of best optimizations under our model yields performance within a geometric mean of 5% of the best combination of candidate optimizations, regardless of the benchmark or its input dataset size.

[1]  John B. Carter,et al.  Efficient remapping mechanisms for an adaptable memory system , 2002 .

[2]  Irvin D. Rutman,et al.  Remains to be seen. , 1995 .

[3]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[4]  Ken Kennedy,et al.  Improving register allocation for subscripted variables , 1990, PLDI '90.

[5]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[6]  John Zahorjan,et al.  Array restructuring for cache locality , 1996 .

[7]  Jeremy D. Frens,et al.  Auto-blocking matrix-multiplication or tracking BLAS3 performance from source code , 1997, PPOPP '97.

[8]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[9]  Ulrich Kremer,et al.  NP-completeness of Dynamic Remapping , 1993 .

[10]  Rafael H. Saavedra-Barrera,et al.  Machine Characterization and Benchmark Performance Prediction , 1988 .

[11]  Erik Brunvand,et al.  Impulse: building a smarter memory controller , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[12]  Wei Li,et al.  Unifying data and control transformations for distributed shared-memory machines , 1995, PLDI '95.

[13]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[14]  Charles E. Leiserson,et al.  Cache-Oblivious Algorithms , 2003, CIAC.

[15]  John Zahorjan,et al.  Optimizing Data Locality by Array Restructuring , 1995 .

[16]  Leigh Stoller,et al.  Increasing TLB reach using superpages backed by shadow memory , 1998, ISCA.

[17]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[18]  Mahmut T. Kandemir,et al.  Optimizing inter-nest data locality , 2002, CASES '02.

[19]  Mithuna Thottethodi,et al.  Nonlinear array layouts for hierarchical memory systems , 1999, ICS '99.

[20]  James R. Larus,et al.  Cache-conscious structure layout , 1999, PLDI '99.

[21]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[22]  Mahmut T. Kandemir,et al.  Improving locality using loop and data transformations in an integrated framework , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[23]  Lixin Zhang URSIM Reference Manual , 1999 .

[24]  Kathryn S. McKinley,et al.  Compiling for the Impulse memory controller , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[25]  Sally A. McKee,et al.  A cost framework for evaluating integrated restructuring optimizations , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[26]  James R. Larus,et al.  EEL: machine-independent executable editing , 1995, PLDI '95.

[27]  Sally A. McKee,et al.  Memory system support for image processing , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[28]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[29]  Mahmut T. Kandemir,et al.  A hyperplane based approach for optimizing spatial locality in loop nests , 1998, ICS '98.

[30]  Sarita V. Adve,et al.  RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors , 1997 .

[31]  Chau-Wen Tseng,et al.  Compiler optimizations for improving data locality , 1994, ASPLOS VI.

[32]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[33]  Ken Kennedy,et al.  Maximizing Loop Parallelism and Improving Data Locality via Loop Fusion and Distribution , 1993, LCPC.

[34]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[35]  Alan Jay Smith,et al.  Measuring Cache and TLB Performance and Their Effect on Benchmark Runtimes , 1995, IEEE Trans. Computers.