Mitigating DRAM complexities through coordinated scheduling policies

Acknowledgments I wish to thank the multitudes of people who helped me in the lengthy and difficult process of completing a Ph.D. while maintaining my full time position at IBM. That said, working on the industry leading IBM POWER architecture team over this time provided me with many real problems to solve, very much an advantage. This advantage was reciprocated into my work, in that the combined research community provided enumerable solutions. I also thank my advisor Lizy John and our research group LCA. Lizy tolerated the low research productivity periods that correspond with high demands at IBM. She also motivated me to continue, when the end seemed hard to reach. LCA also provided an excellent ecosystem in order to conduct my work. Specifically, LCA member Dimitris Kaseridis was invaluable in both enabling the simulation infrastructure knowhow, combined with his expertise in the field. Members of the IBM team were also a key component in completing this work. The specific members of my development organization contributed to the work include, the full list is too long to name, William Starke, Steve Dodson, Warren Maule, and Balaram Sinharoy. IBM research also proved a tremendous resource. Hillery Hunter's command of memory design and keen insights were exceptionally helpful. In addition v various stages. Most of all I would like to thank my family and friends, who most certainly missed my company as I was spending all my free time down at the ACES building. Contemporary DRAM systems have maintained impressive scaling by managing a careful balance between performance, power, and storage density. In achieving these goals, a significant sacrifice has been made in DRAM's operational complexity. To realize good performance, systems must properly manage the significant number of structural and timing restrictions of the DRAM devices. DRAM's efficient use is further complicated in many-core systems where the memory interface has to be shared among multiple cores/threads competing for memory bandwidth. In computer architecture, caches have primarily been viewed as a means to hide memory latency from the CPU. Cache policies have focused on anticipating the CPU's data needs, and are mostly oblivious to the main memory. This work demonstrates that the era of many-core architectures has created new main memory bottlenecks, and mandates a new approach: coordination of cache policy with main memory characteristics. vii Using the cache for memory optimization purposes dramatically expands the memory controller's visibility of processor behavior, at …

[1]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[2]  Zhao Zhang,et al.  DRAM-Level Prefetching for Fully-Buffered DIMM: Design, Performance and Power Saving , 2007, 2007 IEEE International Symposium on Performance Analysis of Systems & Software.

[3]  Balaram Sinharoy,et al.  POWER5 system microarchitecture , 2005, IBM J. Res. Dev..

[4]  David Z. Pan,et al.  An SDRAM-aware router for Networks-on-Chip , 2009, 2009 46th ACM/IEEE Design Automation Conference.

[5]  Manoj Franklin,et al.  Balancing thoughput and fairness in SMT processors , 2001, 2001 IEEE International Symposium on Performance Analysis of Systems and Software. ISPASS..

[6]  Won-Taek Lim,et al.  Effective Management of DRAM Bandwidth in Multicore Processors , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[7]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[8]  Matthias S. Müller,et al.  Memory Performance and Cache Coherency Effects on an Intel Nehalem Multiprocessor System , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[9]  Eduard Ayguadé,et al.  Conflict-free access of vectors with power-of-two strides , 1992, ICS '92.

[10]  James E. Smith,et al.  Fair Queuing Memory Systems , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[11]  Jun Shao,et al.  A Burst Scheduling Access Reordering Mechanism , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[12]  Onur Mutlu,et al.  Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems , 2008, 2008 International Symposium on Computer Architecture.

[13]  Wei-Fen Lin,et al.  Designing a Modern Memory Hierarchy with Hardware Prefetching , 2001, IEEE Trans. Computers.

[14]  Lizy Kurian John,et al.  Elastic Refresh: Techniques to Mitigate Refresh Penalties in High Density Memory , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[15]  David Kroft,et al.  Lockup-free instruction fetch/prefetch cache organization , 1998, ISCA '81.

[16]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[17]  Onur Mutlu,et al.  Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Frank Mueller,et al.  Making DRAM Refresh Predictable , 2010, 2010 22nd Euromicro Conference on Real-Time Systems.

[19]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[20]  Karthick Rajamani,et al.  Power management solutions for computer systems and datacenters , 2008, Proceeding of the 13th international symposium on Low power electronics and design (ISLPED '08).

[21]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[22]  Lizy Kurian John,et al.  ESKIMO - energy savings using semantic knowledge of inconsequential memory occupancy for DRAM subsystem , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[23]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[24]  Margaret Martonosi,et al.  Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors , 2009, ISCA '09.

[25]  Brian Rogers,et al.  Scaling the bandwidth wall: challenges in and avenues for CMP scaling , 2009, ISCA '09.

[26]  Calvin Lin,et al.  Feedback mechanisms for improving probabilistic memory prefetching , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[27]  R. J. Joenk,et al.  IBM journal of research and development: information for authors , 1978 .

[28]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[29]  Mahmut T. Kandemir,et al.  DRAM energy management using software and hardware directed power mode control , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[30]  Mor Harchol-Balter,et al.  ATLAS: A scalable and high-performance scheduling algorithm for multiple memory controllers , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[31]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[32]  Yu Zhang,et al.  A power and temperature aware DRAM architecture , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[33]  Mor Harchol-Balter,et al.  Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  David W. Nellans,et al.  Micro-pages: increasing DRAM efficiency with locality-aware data placement , 2010, ASPLOS XV.

[35]  Onur Mutlu,et al.  Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems , 2007, USENIX Security Symposium.

[36]  Hsien-Hsin S. Lee,et al.  Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[37]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[38]  B. Jacob,et al.  CMP Memory Modeling : How Much Does Accuracy Matter ? , 2009 .

[39]  Eric M. Schwarz,et al.  IBM POWER6 microarchitecture , 2007, IBM J. Res. Dev..

[40]  Dam Sunwoo,et al.  FPGA-Accelerated Simulation Technologies (FAST): Fast, Full-System, Cycle-Accurate Simulators , 2007, MICRO.

[41]  Onur Mutlu,et al.  Prefetch-Aware DRAM Controllers , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[42]  M. Wordeman,et al.  An 800-MHz embedded DRAM with a concurrent refresh mode , 2005, IEEE Journal of Solid-State Circuits.

[43]  Lizy Kurian John,et al.  The virtual write queue: coordinating DRAM and last-level cache policies , 2010, ISCA.

[44]  Wei-Fen Lin,et al.  Reducing DRAM latencies with an integrated memory hierarchy design , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[45]  Carla Schlatter Ellis,et al.  Memory controller policies for DRAM power management , 2001, ISLPED '01.

[46]  Chung-Ho Chen,et al.  An effective SDRAM power mode management scheme for performance and energy sensitive embedded systems , 2003, ASP-DAC '03.

[47]  Gary S. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.