PARBLO: Page-Allocation-Based DRAM Row Buffer Locality Optimization

DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new page-allocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%).

[1]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[2]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[3]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[4]  James Westall,et al.  An empirical study of the effects of careful page placement in Linux , 1998, ACM-SE 36.

[5]  Nikil D. Dutt,et al.  Memory aware compilation through accurate timing extraction , 2000, Proceedings 37th Design Automation Conference.

[6]  Mahmut Kandemir,et al.  Memory bank aware dynamic loop scheduling , 2007 .

[7]  Mahmut T. Kandemir,et al.  Exploiting bank locality in multi-bank memories , 2003, CASES '03.

[8]  Won-Taek Lim,et al.  Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[9]  Zhao Zhang,et al.  Breaking Address Mapping Symmetry at Multi-levels of Memory Heirarchy to Reduce DRAM Row-buffer Conflicts , 2001, J. Instr. Level Parallelism.

[10]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Calvin Lin,et al.  Adaptive History-Based Memory Schedulers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[12]  Richard McDougall,et al.  Solaris Internals: Solaris 10 and OpenSolaris Kernel Architecture , 2006 .

[13]  Anoop Gupta,et al.  Design and evaluation of a compiler algorithm for prefetching , 1992, ASPLOS V.

[14]  Richard E. Kessler,et al.  Page placement algorithms for large real-indexed caches , 1992, TOCS.

[15]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[16]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[17]  Jacqueline Chame,et al.  A Compiler Algorithm for Exploiting Page-Mode Memory Access in Embedded-DRAM Devices , .

[18]  Todd C. Mowry,et al.  Compiler-directed page coloring for multiprocessors , 1996, ASPLOS VII.

[19]  Sally A. McKee,et al.  Dynamic Access Ordering for Streamed Computations , 2000, IEEE Trans. Computers.

[20]  David W. Binkley,et al.  Interprocedural slicing using dependence graphs , 1988, SIGP.

[21]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[22]  Scott Rixner,et al.  Memory Controller Optimizations for Web Servers , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).

[23]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[24]  Jun Shao,et al.  A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[25]  Ken Kennedy,et al.  Improving effective bandwidth through compiler enhancement of global cache reuse , 2001, Proceedings 15th International Parallel and Distributed Processing Symposium. IPDPS 2001.

[26]  Zhao Zhang,et al.  Memory Access Scheduling Schemes for Systems with Multi-Core Processors , 2008, 2008 37th International Conference on Parallel Processing.

[27]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[28]  Yutao Zhong,et al.  Predicting whole-program locality through reuse distance analysis , 2003, PLDI.

[29]  Zhao Zhang,et al.  Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[30]  Tao Li,et al.  Informed Microarchitecture Design Space Exploration Using Workload Dynamics , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).