Frugal ECC: efficient and versatile memory error protection through fine-grained compression

Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; insufficiently compressed blocks require additional storage and accesses. As examples, we present chipkill-correct ECCs on a non-ECC DIMM with x4 chips and the first true chipkill-correct ECC for x8 devices using an ECC DIMM. FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.

[1]  Yiannakis Sazeides,et al.  Modeling the implications of DRAM failures and protection techniques on datacenter TCO , 2015, 2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[2]  Mohan J. Kumar Advanced Reliability for Intel ® Xeon ® Processor-based Servers With an array of new reliability, availability, and serviceability (RAS) features, the Intel® Xeon® processor 7500 series offers exceptional data integrity and resilience for mission-critical computing environments. , 2010 .

[3]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[4]  Mikko H. Lipasti,et al.  COP: To compress and protect main memory , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[5]  Prashant J. Nair,et al.  FAULTSIM : A fast , configurable memory-resilience simulator , 2014 .

[6]  Gang Liu,et al.  Miss-Correlation Folding: Encoding Per-Block Miss Correlations in Compressed DRAM for Data Prefetching , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[7]  Yiannakis Sazeides,et al.  The Implications of Different DRAM Protection Techniques on Datacenter TCO , 2015 .

[8]  A Memo on Exploration of SPLASH-2 Input Sets , 2011 .

[9]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[10]  Harish Patil,et al.  Pin: building customized program analysis tools with dynamic instrumentation , 2005, PLDI '05.

[11]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[12]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[13]  Thomas F. Wenisch,et al.  SMARTS: accelerating microarchitecture simulation via rigorous statistical sampling , 2003, ISCA '03.

[14]  Onur Mutlu,et al.  Base-delta-immediate compression: Practical data compression for on-chip caches , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[15]  Hubertus Franke,et al.  Memory Expansion Technology (MXT): Software support and performance , 2001, IBM J. Res. Dev..

[16]  Long Chen,et al.  Free ECC: An efficient error protection for compressed last-level caches , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[17]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[18]  John Shalf,et al.  Memory Errors in Modern Systems: The Good, The Bad, and The Ugly , 2015, ASPLOS.

[19]  Jun Yang,et al.  Frequent value compression in data caches , 2000, MICRO 33.

[20]  Onur Mutlu,et al.  Linearly compressed pages: A low-complexity, low-latency main memory compression framework , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[21]  Rakesh Kumar,et al.  ECC Parity: A Technique for Efficient Memory Error Resilience for Multi-Channel Memory Systems , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[22]  John Sartori,et al.  Low-power, low-storage-overhead chipkill correct via multi-line error correction , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[23]  David A. Wood,et al.  Frequent Pattern Compression: A Significance-Based Compression Scheme for L2 Caches , 2004 .

[24]  Ravishankar K. Iyer,et al.  Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[25]  Vilas Sridharan,et al.  A study of DRAM failures in the field , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[27]  Rajeev Balasubramonian,et al.  MemZip: Exploring unconventional benefits from memory compression , 2014, 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA).

[28]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[30]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[31]  Onur Mutlu,et al.  Linearly compressed pages: A main memory compression framework with low complexity and low latency , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[32]  Mattan Erez,et al.  Bamboo ECC: Strong, safe, and flexible codes for reliable computer memory , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[33]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[34]  Christian Bienia,et al.  Benchmarking modern multiprocessors , 2011 .