Tuning Blocked Array Layouts to Exploit Memory Hierarchy in SMT Architectures

Cache misses form a major bottleneck for memory-intensive applications, due to the significant latency of main memory accesses. Loop tiling, in conjunction with other program transformations, have been shown to be an effective approach to improving locality and cache exploitation, especially for dense matrix scientific computations. Beyond loop nest optimizations, data transformation techniques, and in particular blocked data layouts, have been used to boost the cache performance. The stability of performance improvements achieved are heavily dependent on the appropriate selection of tile sizes. In this paper, we investigate the memory performance of blocked data layouts, and provide a theoretical analysis for the multiple levels of memory hierarchy, when they are organized in a set associative fashion. According to this analysis, the optimal tile size that maximizes L1 cache utilization, should completely fit in the L1 cache, even for loop bodies that access more than just one array. Increased self- or/and cross-interference misses can be tolerated through prefetching. Such larger tiles also reduce mispredicted branches and, as a result, the lost CPU cycles that arise. Results are validated through actual benchmarks on an SMT platform.

[1]  Chau-Wen Tseng,et al.  Improving data locality with loop transformations , 1996, TOPL.

[2]  Sharad Malik,et al.  Cache miss equations: a compiler framework for analyzing and tuning memory behavior , 1999, TOPL.

[3]  Karim Esseghir Improving data locality for caches , 1993 .

[4]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[5]  Mahmut T. Kandemir,et al.  Improving Cache Locality by a Combination of Loop and Data Transformation , 1999, IEEE Trans. Computers.

[6]  Chau-Wen Tseng,et al.  Locality Optimizations for Multi-Level Caches , 1999, ACM/IEEE SC 1999 Conference (SC'99).

[7]  Olivier Temam,et al.  To copy or not to copy: A compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93. Proceedings.

[8]  Monica S. Lam,et al.  The cache performance and optimizations of blocked algorithms , 1991, ASPLOS IV.

[9]  Michael E. Wolf,et al.  Combining Loop Transformations Considering Caches and Scheduling , 2004, International Journal of Parallel Programming.

[10]  Xavier Vera,et al.  Cache and Compiler Interaction : How to analyze, optimize and time cache behavior , 2003 .

[11]  W. Jalby,et al.  To copy or not to copy: a compile-time technique for assessing when data copying should be used to eliminate cache conflicts , 1993, Supercomputing '93.

[12]  Chau-Wen Tseng,et al.  A Comparison of Compiler Tiling Algorithms , 1999, CC.

[13]  Ulrich Kremer,et al.  A Quantitative Analysis of Tile Size Selection Algorithms , 2004, The Journal of Supercomputing.

[14]  Olivier Temam,et al.  Cache interference phenomena , 1994, SIGMETRICS.

[15]  Zhiyuan Li,et al.  IMPACT OF TILE-SIZE SELECTION FOR SKEWED TILING , 2001 .

[16]  Jacqueline Chame,et al.  A tile selection algorithm for data locality and cache interference , 1999, ICS '99.

[17]  Hiroshi Nakamura,et al.  Augmenting Loop Tiling with Data Alignment for Improved Cache Performance , 1999, IEEE Trans. Computers.

[18]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[19]  Monica S. Lam,et al.  A data locality optimizing algorithm , 1991, PLDI '91.

[20]  Nectarios Koziris,et al.  Fast indexing for blocked array layouts to improve multi-level cache locality , 2004, Eighth Workshop on Interaction between Compilers and Computer Architectures, 2004. INTERACT-8 2004..

[21]  Graham R. Nudd,et al.  Analytical Modeling of Set-Associative Cache Behavior , 1999, IEEE Trans. Computers.

[22]  Viktor K. Prasanna,et al.  Analysis of memory hierarchy performance of block data layout , 2002, Proceedings International Conference on Parallel Processing.

[23]  Nectarios Koziris,et al.  A tile size selection analysis for blocked array layouts , 2005, 9th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT'05).

[24]  Chau-Wen Tseng,et al.  Eliminating conflict misses for high performance architectures , 1998, ICS '98.

[25]  Larry Carter,et al.  Quantifying the Multi-Level Nature of Tiling Interactions , 1997, International Journal of Parallel Programming.