论文信息 - On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance

On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited

As the complexity of processor architectures increases, there is a widening gap between peak processor performance and sustained processor performance so that programs now tend to exploit only a fraction of available performance. While there is a tremendous amount of literature on program optimizations, compiler optimizations lack efficiency because they are plagued by three flaws: (1) they often implicitly use simplified, if not simplistic, models of processor architecture, (2) they usually focus on a single processor component (e.g., cache) and ignore the interactions among multiple components, (3) the most heavily nvestigated components (e.g., caches) sometimes have only a small impact on overall performance. Through the in-depth analysis of a simple program kernel, we want to show that understanding the complex interactions between programs and the numerous processor architecture components is both feasible and critical to design efficient program optimizations.

David Parello | Olivier Temam | Jean-Marie Verdun

[1] Larry Carter,et al. Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.

[2] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.

[3] Michael F. P. O'Boyle,et al. The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..

[4] James E. Smith,et al. The microarchitecture of superscalar processors , 1995, Proc. IEEE.

[5] Michael E. Wolf,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[6] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..

[7] W. Jalby,et al. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[8] Chau-Wen Tseng,et al. A Comparison of Compiler Tiling Algorithms , 1999, CC.

[9] Alan Jay Smith,et al. Cache Memories , 1982, CSUR.

[10] Tomás Lang,et al. MOB forms: a class of multilevel block algorithms for dense linear algebra operations , 1994, ICS '94.

[11] Monica S. Lam,et al. A data locality optimizing algorithm (with retrospective) , 1991 .

[12] Sharad Malik,et al. Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.

[13] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..

[14] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[15] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[16] Michael F. P. O'Boyle,et al. Fast and Accurate Evaluation of Memory Performance Upper-Bound , 2001 .

[17] Alvin R. Lebeck,et al. Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.

[18] Larry Carter,et al. Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.

[19] Michael F. P. O'Boyle,et al. Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).

[20] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.

[21] Bowen Alpern,et al. Hierarchical Tiling: A Methodology for High Performance , 1996 .

[22] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.

[23] Larry Carter,et al. Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.

[24] T. Kisuki,et al. Iterative Compilation in Program Optimization , 2000 .

[25] Steve Carr,et al. Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.

[26] Bruce Leasure,et al. The KAP Parallelizer for DEC Fortran and DEC C Programs , 1994, Digit. Tech. J..

[27] Olivier Temam,et al. A quantitative analysis of loop nest locality , 1996, ASPLOS VII.

[28] Maged M. Michael,et al. Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.

[29] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.

[30] Steve Carr,et al. Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.