On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited

As the complexity of processor architectures increases, there is a widening gap between peak processor performance and sustained processor performance so that programs now tend to exploit only a fraction of available performance. While there is a tremendous amount of literature on program optimizations, compiler optimizations lack efficiency because they are plagued by three flaws: (1) they often implicitly use simplified, if not simplistic, models of processor architecture, (2) they usually focus on a single processor component (e.g., cache) and ignore the interactions among multiple components, (3) the most heavily nvestigated components (e.g., caches) sometimes have only a small impact on overall performance. Through the in-depth analysis of a simple program kernel, we want to show that understanding the complex interactions between programs and the numerous processor architecture components is both feasible and critical to design efficient program optimizations.

[1]  Larry Carter,et al.  Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[2]  Siddhartha Chatterjee,et al.  Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[3]  Michael F. P. O'Boyle,et al.  The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[4]  James E. Smith,et al.  The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[5]  Michael E. Wolf,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[6]  Monica S. Lam,et al.  Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[7]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[8]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[9]  Alan Jay Smith,et al.  Cache Memories , 1982, CSUR.

[10]  Tomás Lang,et al.  MOB forms: a class of multilevel block algorithms for dense linear algebra operations , 1994, ICS '94.

[11]  Monica S. Lam,et al.  A data locality optimizing algorithm (with retrospective) , 1991 .

[12]  Sharad Malik,et al.  Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[13]  Anoop Gupta,et al.  Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[14]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[15]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[16]  Michael F. P. O'Boyle,et al.  Fast and Accurate Evaluation of Memory Performance Upper-Bound , 2001 .

[17]  Alvin R. Lebeck,et al.  Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18]  Larry Carter,et al.  Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[19]  Michael F. P. O'Boyle,et al.  Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[20]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[21]  Bowen Alpern,et al.  Hierarchical Tiling: A Methodology for High Performance , 1996 .

[22]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[23]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[24]  T. Kisuki,et al.  Iterative Compilation in Program Optimization , 2000 .

[25]  Steve Carr,et al.  Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[26]  Bruce Leasure,et al.  The KAP Parallelizer for DEC Fortran and DEC C Programs , 1994, Digit. Tech. J..

[27]  Olivier Temam,et al.  A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[28]  Maged M. Michael,et al.  Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.

[29]  David F. Bacon,et al.  Compiler transformations for high-performance computing , 1994, CSUR.

[30]  Steve Carr,et al.  Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.