论文信息 - PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization

PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization

DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new page-allocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%).

Li Chen | Xiaobing Feng | Wei Mi | Jingling Xue | Yao-Cang Jia

[1] Zhao Zhang,et al. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[2] Wei-Fen Lin,et al. Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3] Onur Mutlu,et al. Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[4] James Westall,et al. An empirical study of the effects of careful page placement in Linux , 1998, ACM-SE 36.

[5] Nikil D. Dutt,et al. Memory aware compilation through accurate timing extraction , 2000, Proceedings 37th Design Automation Conference.

[6] Mahmut Kandemir,et al. Memory bank aware dynamic loop scheduling , 2007 .

[7] Mahmut T. Kandemir,et al. Exploiting bank locality in multi-bank memories , 2003, CASES '03.

[8] Won-Taek Lim,et al. Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[9] Zhao Zhang,et al. Breaking Address Mapping Symmetry at Multi-levels of Memory Heirarchy to Reduce DRAM Row-buffer Conflicts , 2001, J. Instr. Level Parallelism.

[10] Onur Mutlu,et al. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11] Calvin Lin,et al. Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12] Richard McDougall,et al. Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture , 2006 .

[13] Anoop Gupta,et al. Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14] Richard E. Kessler,et al. Page placement algorithms for large real-indexed caches , 1992, TOCS.

[15] James E. Smith,et al. Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16] Hironori Kasahara,et al. Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[17] Jacqueline Chame,et al. A Compiler Algorithm for Exploiting Page-Mode Memory Access in Embedded-DRAM Devices , .

[18] Todd C. Mowry,et al. Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[19] Sally A. McKee,et al. Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[20] David W. Binkley,et al. Interprocedural slicing using dependence graphs , 1988, SIGP.

[21] Aamer Jaleel,et al. DRAMsim: a memory system simulator , 2005, CARN.

[22] Scott Rixner,et al. Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23] William J. Dally,et al. Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24] Jun Shao,et al. A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[25] Ken Kennedy,et al. Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[26] Zhao Zhang,et al. Memory Access Scheduling Schemes for Systems with Multi-Core Processors , 2008, 2008 37th International Conference on Parallel Processing.

[27] Onur Mutlu,et al. Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[28] Yutao Zhong,et al. Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[29] Zhao Zhang,et al. Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[30] Tao Li,et al. Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).