On Increasing Architecture Awareness in Program Optimizations to Bridge the Gap between Peak and Sustained Processor Performance — Matrix-Multiply Revisited
暂无分享,去创建一个
[1] Larry Carter,et al. Hierarchical tiling for improved superscalar performance , 1995, Proceedings of 9th International Parallel Processing Symposium.
[2] Siddhartha Chatterjee,et al. Exact analysis of the cache behavior of nested loops , 2001, PLDI '01.
[3] Michael F. P. O'Boyle,et al. The effect of cache models on iterative compilation for combined tiling and unrolling , 2004, Concurr. Comput. Pract. Exp..
[4] James E. Smith,et al. The microarchitecture of superscalar processors , 1995, Proc. IEEE.
[5] Michael E. Wolf,et al. The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.
[6] Monica S. Lam,et al. Maximizing Multiprocessor Performance with the SUIF Compiler , 1996, Digit. Tech. J..
[7] W. Jalby,et al. To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.
[8] Chau-Wen Tseng,et al. A Comparison of Compiler Tiling Algorithms , 1999, CC.
[9] Alan Jay Smith,et al. Cache Memories , 1982, CSUR.
[10] Tomás Lang,et al. MOB forms: a class of multilevel block algorithms for dense linear algebra operations , 1994, ICS '94.
[11] Monica S. Lam,et al. A data locality optimizing algorithm (with retrospective) , 1991 .
[12] Sharad Malik,et al. Precise miss analysis for program transformations with caches of arbitrary associativity , 1998, ASPLOS VIII.
[13] Anoop Gupta,et al. Tolerating Latency Through Software-Controlled Prefetching in Shared-Memory Multiprocessors , 1991, J. Parallel Distributed Comput..
[14] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.
[15] James Demmel,et al. Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.
[16] Michael F. P. O'Boyle,et al. Fast and Accurate Evaluation of Memory Performance Upper-Bound , 2001 .
[17] Alvin R. Lebeck,et al. Load latency tolerance in dynamically scheduled processors , 1998, Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture.
[18] Larry Carter,et al. Memory hierarchy considerations for fast transpose and bit-reversals , 1999, Proceedings Fifth International Symposium on High-Performance Computer Architecture.
[19] Michael F. P. O'Boyle,et al. Combined Selection of Tile Sizes and Unroll Factors Using Iterative Compilation , 2000, Proceedings 2000 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00622).
[20] Kathryn S. McKinley,et al. Tile size selection using cache organization and data layout , 1995, PLDI '95.
[21] Bowen Alpern,et al. Hierarchical Tiling: A Methodology for High Performance , 1996 .
[22] Monica S. Lam,et al. A data locality optimizing algorithm , 1991, PLDI '91.
[23] Larry Carter,et al. Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.
[24] T. Kisuki,et al. Iterative Compilation in Program Optimization , 2000 .
[25] Steve Carr,et al. Combining optimization for cache and instruction-level parallelism , 1996, Proceedings of the 1996 Conference on Parallel Architectures and Compilation Technique.
[26] Bruce Leasure,et al. The KAP Parallelizer for DEC Fortran and DEC C Programs , 1994, Digit. Tech. J..
[27] Olivier Temam,et al. A quantitative analysis of loop nest locality , 1996, ASPLOS VII.
[28] Maged M. Michael,et al. Accuracy and speed-up of parallel trace-driven architectural simulation , 1997, Proceedings 11th International Parallel Processing Symposium.
[29] David F. Bacon,et al. Compiler transformations for high-performance computing , 1994, CSUR.
[30] Steve Carr,et al. Unroll-and-jam using uniformly generated sets , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.