The dynamic granularity memory system

Chip multiprocessors enable continued performance scaling with increasingly many cores per chip. As the throughput of computation outpaces available memory bandwidth, however, the system bottleneck will shift to main memory. We present a memory system, the dynamic granularity memory system (DGMS), which avoids unnecessary data transfers, saves power, and improves system performance by dynamically changing between fine and coarse-grained memory accesses. DGMS predicts memory access granularities dynamically in hardware, and does not require software or OS support. The dynamic operation of DGMS gives it superior ease of implementation and power efficiency relative to prior multi-granularity memory systems, while maintaining comparable levels of system performance.

[1]  John S. Liptay,et al.  Structural Aspects of the System/360 Model 85 II: The Cache , 1968, IBM Syst. J..

[2]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[3]  A. Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost and low miss ratio , 1994, Proceedings of 21 International Symposium on Computer Architecture.

[4]  André Seznec,et al.  Decoupled sectored caches: conciliating low tag implementation cost , 1994, ISCA '94.

[5]  Anne Rogers,et al.  Software caching and computation migration in Olden , 1995, PPOPP '95.

[6]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[7]  M. Valero,et al.  A data cache with multiple caching strategies tuned to different types of locality , 1995, ICS '95.

[8]  Graham A. Jullien,et al.  An efficient tree architecture for modulo 2n+1 multiplication , 1996, J. VLSI Signal Process..

[9]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[10]  Sanjeev Kumar,et al.  Exploiting spatial locality in data caches using spatial footprints , 1998, ISCA.

[11]  Jeffrey B. Rothman,et al.  The pool of subsectors cache design , 1999, ICS '99.

[12]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[13]  William J. Dally,et al.  Memory access scheduling , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[14]  Jaehyuk Huh,et al.  Exploring the design space of future CMPs , 2001, Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques.

[15]  Babak Falsafi,et al.  Accurate and complexity-effective spatial pattern prediction , 2004, 10th International Symposium on High Performance Computer Architecture (HPCA'04).

[16]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[17]  Brad Calder,et al.  SimPoint 3.0: Faster and More Flexible Program Phase Analysis , 2005, J. Instr. Level Parallelism.

[18]  Thomas F. Wenisch,et al.  Spatial Memory Streaming , 2006, 33rd International Symposium on Computer Architecture (ISCA'06).

[19]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[20]  Yale N. Patt,et al.  Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[21]  Steve Scott,et al.  The Cray BlackWidow: a highly scalable vector multiprocessor , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[22]  Wolf-Dietrich Weber,et al.  Power provisioning for a warehouse-sized computer , 2007, ISCA '07.

[23]  Peter M. Kogge,et al.  On the Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications , 2007, IEEE Transactions on Computers.

[24]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[25]  Stijn Eyerman,et al.  System-Level Performance Metrics for Multiprogram Workloads , 2008, IEEE Micro.

[26]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[27]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.

[28]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[29]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[30]  Jung Ho Ahn,et al.  McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Gabriel H. Loh,et al.  Zesto: A cycle-level simulator for highly detailed microarchitecture exploration , 2009, 2009 IEEE International Symposium on Performance Analysis of Systems and Software.

[32]  Thomas F. Wenisch,et al.  Disaggregated memory for expansion and sharing in blade servers , 2009, ISCA '09.

[33]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[34]  Tony M. Brewer,et al.  Instruction Set Innovations for the Convey HC-1 Computer , 2010, IEEE Micro.

[35]  Doe Hyun Yoon,et al.  Adaptive granularity memory systems: A tradeoff between storage efficiency and throughput , 2011, 2011 38th Annual International Symposium on Computer Architecture (ISCA).

[36]  Rajesh K. Gupta,et al.  Onyx: A Prototype Phase Change Memory Storage Array , 2011, HotStorage.