Virtualized and flexible ECC for main memory

We present a general scheme for virtualizing main memory error-correction mechanisms, which map redundant information needed to correct errors into the memory namespace itself. We rely on this basic idea, which increases flexibility to increase error protection capabilities, improve power efficiency, and reduce system cost; with only small performance overheads. We augment the virtual memory system architecture to detach the physical mapping of data from the physical mapping of its associated ECC information. We then use this mechanism to develop two-tiered error protection techniques that separate the process of detecting errors from the rare need to also correct errors, and thus save energy. We describe how to provide strong chipkill and double-chip kill protection using existing DRAM and packaging technology. We show how to maintain access granularity and redundancy overheads, even when using ×8 DRAM chips. We also evaluate error correction for systems that do not use ECC DIMMs. Overall, analysis of demanding SPEC CPU 2006 and PARSEC benchmarks indicates that performance overhead is only 1% with ECC DIMMs and less than 10% using standard Non-ECC DIMM configurations, that DRAM power savings can be as high as 27%, and that the system energy-delay product is improved by 12% on average.

[1]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[2]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[3]  Maurice Steinman,et al.  ECC implementation in components without ECC , 2008 .

[4]  Koushik Chakraborty,et al.  Mixed-mode multicore reliability , 2009, ASPLOS.

[5]  Fredrik Larsson,et al.  Simics: A Full System Simulation Platform , 2002, Computer.

[6]  M. Ma,et al.  Impact of Error Correction Code and Dynamic Memory Reconfiguration on High-Reliability/Low-Cost Server Memory , 2006, 2006 IEEE International Integrated Reliability Workshop Final Report.

[7]  Jeffrey D. Gilbert,et al.  Over one million TPCC with a 45nm 6-core Xeon® CPU , 2009, 2009 IEEE International Solid-State Circuits Conference - Digest of Technical Papers.

[8]  Daniel J. Sorin,et al.  Choosing an Error Protection Scheme for a Microprocessor's L1 Data Cache , 2006, 2006 International Conference on Computer Design.

[9]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[10]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[11]  Milo M. K. Martin,et al.  Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset , 2005, CARN.

[12]  Zhao Zhang,et al.  A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality , 2000, MICRO 33.

[13]  Trevor N. Mudge,et al.  Understanding and Designing New Server Architectures for Emerging Warehouse-Computing Environments , 2008, 2008 International Symposium on Computer Architecture.

[14]  C. L. Chen Symbol error correcting codes for memory applications , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[15]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[16]  Greg Hamerly,et al.  SimPoint 3.0: Faster and More Flexible Program Analysis , 2005 .

[17]  Aamer Jaleel,et al.  DRAMsim: a memory system simulator , 2005, CARN.

[18]  Krste Asanovic,et al.  Mondrian memory protection , 2002, ASPLOS X.

[19]  Doe Hyun Yoon,et al.  Flexible cache error protection using an ECC FIFO , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[20]  Timothy Johnson,et al.  An 8-core, 64-thread, 64-bit power efficient sparc soc (niagara2) , 2007, ISPD '07.

[21]  Shu Lin,et al.  Error control coding : fundamentals and applications , 1983 .

[22]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[23]  Timothy J. Dell,et al.  System RAS implications of DRAM soft errors , 2008, IBM J. Res. Dev..

[24]  Mark D. Hill,et al.  Surpassing the TLB performance of superpages with less operating system support , 1994, ASPLOS VI.

[25]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[26]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[27]  Doe Hyun Yoon,et al.  Memory mapped ECC: low-cost error protection for last level caches , 2009, ISCA '09.

[28]  Bruce Jacob,et al.  Memory Systems: Cache, DRAM, Disk , 2007 .

[29]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[30]  James E. Smith,et al.  Implementing high availability memory with a duplication cache , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[31]  G. Tyson,et al.  Eager writeback-a technique for improving bandwidth utilization , 2000, Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000.

[32]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[33]  Abraham Silberschatz,et al.  Operating System Concepts , 1983 .

[34]  Jung Ho Ahn,et al.  Multicore DIMM: an Energy Efficient Memory Module with Independently Controlled DRAMs , 2009, IEEE Computer Architecture Letters.