论文信息 - Trading cache hit rate for memory performance

Trading cache hit rate for memory performance

Most of the prior compiler based data locality optimization works target exclusively cache locality optimization, and row-buffer locality in DRAM banks received much less attention. In particular, to the best of our knowledge, there is no single compiler based approach that can improve row-buffer locality in executing irregular applications. This presents a critical problem considering the fact that executing irregular applications in a power and performance efficient manner will be a key requirement to extract maximum benefits from emerging multicore machines and exascale systems. Motivated by these observations, this paper makes the following contributions. First, it presents a compiler-runtime cooperative data layout optimization approach that takes as input an irregular program that has already been optimized for cache locality and generates an output code with the same cache performance but better row-buffer locality (lower number of row-buffer misses). Second, it discusses a more aggressive strategy that sacrifices some cache performance in order to further improve row-buffer performance (i.e., it trades cache performance for memory system performance). The ultimate goal of this strategy is to find the right tradeoff point between cache performance and row-buffer performance so that the overall application performance is improved. Third, the paper performs a detailed evaluation of these two approaches using both an AMD Opteron based multicore system and a multicore simulator. The experimental results, collected using five real-world irregular applications, show that (i) conventional cache optimizations do not improve row-buffer locality significantly; (ii) our first approach achieves about 9.8% execution time improvement by keeping the number of cache misses the same as a cache-optimized code but reducing the number of row-buffer misses; and (iii) our second approach achieves even higher execution time improvements (13.8% on average) by sacrificing cache performance for additional memory performance.

[1] David W. Nellans,et al. Prediction Based DRAM Row-Buffer Management in the Many-Core Era , 2011, 2011 International Conference on Parallel Architectures and Compilation Techniques.

[2] Mor Harchol-Balter,et al. ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[3] Shahid H. Bokhari,et al. A Partitioning Strategy for Nonuniform Problems on Multiprocessors , 1987, IEEE Transactions on Computers.

[4] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5] Dror Irony,et al. An out-of-core sparse symmetric-indefinite factorization method , 2006, TOMS.

[6] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[7] Joel H. Saltz,et al. Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[8] A. H. Sherman,et al. Comparative Analysis of the Cuthill–McKee and the Reverse Cuthill–McKee Ordering Algorithms for Sparse Matrices , 1976 .

[9] Tiranee Achalakul,et al. Improving Data Processing Time with Access Sequence Prediction , 2012, 2012 IEEE 18th International Conference on Parallel and Distributed Systems.

[10] Chau-Wen Tseng,et al. Exploiting locality for irregular scientific codes , 2006, IEEE Transactions on Parallel and Distributed Systems.

[11] Dror Rawitz,et al. The hardness of cache conscious data placement , 2002, POPL '02.

[12] Intel ® Pentium ® 4 and Intel ® Xeon TM Processor Optimization Reference Manual , 2004 .

[13] References , 1971 .

[14] Albert Cohen,et al. Deep jam: conversion of coarse-grain parallelism to instruction-level and vector parallelism for irregular applications , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[15] Ken Kennedy,et al. Improving cache performance in dynamic applications through data and computation reorganization at run time , 1999, PLDI '99.

[16] Kevin Kai-Wei Chang,et al. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[17] Keshav Pingali,et al. The tao of parallelism in algorithms , 2011, PLDI '11.

[18] Jacqueline Chame,et al. A Compiler Algorithm for Exploiting Page-Mode Memory Access in Embedded-DRAM Devices , .

[19] Khalid Omar Thabit,et al. Cache management by the compiler , 1982 .

[20] Tor M. Aamodt,et al. Complexity effective memory access scheduling for many-core accelerator architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21] Chen Ding,et al. A hierarchical model of data locality , 2006, POPL '06.

[22] Larry Carter,et al. Localizing non-affine array references , 1999, 1999 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.PR00425).

[23] Onur Mutlu,et al. DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems , 2010 .

[24] Donald M. Chiarulli,et al. Predicting Multiprocessor Memory Access Patterns with Learning Models , 1997, ICML.

[25] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[26] Vipin Kumar,et al. A Fast and High Quality Multilevel Scheme for Partitioning Irregular Graphs , 1998, SIAM J. Sci. Comput..

[27] Ken Kennedy,et al. Improving memory hierarchy performance for irregular applications , 1999, ICS '99.

[28] Chen Ding,et al. Array regrouping and structure splitting using whole-program reference affinity , 2004, PLDI '04.

[29] Rachata Ausavarungnirun,et al. Row buffer locality aware caching policies for hybrid memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[30] Mahmut T. Kandemir,et al. Reshaping cache misses to improve row-buffer locality in multicore systems , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[31] Zhao Zhang,et al. A performance comparison of DRAM memory system optimizations for SMT processors , 2005, 11th International Symposium on High-Performance Computer Architecture.

[32] Mor Harchol-Balter,et al. Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[33] David W. Nellans,et al. Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[34] Larry Carter,et al. Compile-time composition of run-time data and iteration reorderings , 2003, PLDI '03.

[35] Onur Mutlu,et al. Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[36] Jing Li,et al. A case for small row buffers in non-volatile main memories , 2012, 2012 IEEE 30th International Conference on Computer Design (ICCD).

[37] David A. Patterson,et al. Computer Architecture: A Quantitative Approach , 1969 .

[38] Rachata Ausavarungnirun,et al. Row Buffer Locality-Aware Data Placement in Hybrid Memories , 2011 .

[39] David Kroft,et al. Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[40] Keshav Pingali,et al. How much parallelism is there in irregular applications? , 2009, PPoPP '09.

[41] Tao Li,et al. Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).