MemGuard: A low cost and energy efficient design to support and enhance memory system reliability

Memory system reliability is increasingly a concern as memory cell density and capacity continue to grow. The conventional approach is to use redundant memory bits for error detection and correction, with significant storage, cost and power overheads. In this paper, we propose a novel, system-level scheme called MemGuard for memory error detection. With OS-based checkpointing, it is also able to recover program execution from memory errors. The memory error detection of MemGuard is motivated by memory integrity verification using log hashes. It is much stronger than SECDED in error detection, incurs negligible hardware cost and energy overhead and no storage overhead, and is compatible with various memory organizations. It may play the role of ECC memory in consumer-level computers and mobile devices, without the shortcomings of ECC memory. In server computers, it may complement SECDED ECC or Chipkill Correct by providing even stronger error detection. We have comprehensively investigated and evaluated the feasibility and reliability of MemGuard. We show that using an incremental multiset hash function and a non-cryptographic hash function, the performance and energy overheads of Mem-Guard are negligible. We use the mathematical deduction and synthetic simulation to prove that MemGuard is robust and reliable.

[1]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[2]  Norman P. Jouppi,et al.  LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[3]  A. Singh,et al.  Fault-tolerant systems , 1990, Computer.

[4]  Shunfei Chen,et al.  MARSS: A full system simulator for multicore x86 CPUs , 2011, 2011 48th ACM/EDAC/IEEE Design Automation Conference (DAC).

[5]  Akashi Satoh,et al.  ASIC hardware focused comparison for hash functions MD5, RIPEMD-160, and SHS , 2005, International Conference on Information Technology: Coding and Computing (ITCC'05) - Volume II.

[6]  G. Edward Suh,et al.  Efficient Memory Integrity Verification and Encryption for Secure Processors , 2003, MICRO.

[7]  Kai Li,et al.  Memory Exclusion: Optimizing the Performance of Checkpointing Systems , 1999, Softw. Pract. Exp..

[8]  Ronald L. Rivest,et al.  The MD5 Message-Digest Algorithm , 1992, RFC.

[9]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[10]  Mihir Bellare,et al.  Incremental Cryptography: The Case of Hashing and Signing , 1994, CRYPTO.

[11]  Christoforos E. Kozyrakis,et al.  Future scaling of processor-memory interfaces , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[12]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[13]  John L. Henning SPEC CPU2006 memory footprint , 2007, CARN.

[14]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[15]  R. Krishnamurthy,et al.  A 6.5GHz 54mW 64-bit Parity-Checking Adder for 65nm Fault-Tolerant Microprocessor Execution Cores , 2007, 2007 IEEE Symposium on VLSI Circuits.

[16]  Rolf Riesen,et al.  Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing , 2012, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[17]  Laxmikant V. Kalé,et al.  A scalable double in-memory checkpoint and restart scheme towards exascale , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012).

[18]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[19]  Ronald L. Rivest,et al.  The MD4 Message-Digest Algorithm , 1990, RFC.

[20]  A. Johnston Scaling and Technology Issues for Soft Error Rates , 2000 .

[21]  G. Edward Suh,et al.  Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking , 2003, ASIACRYPT.

[22]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[23]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[24]  Dae-Hyun Kim,et al.  ArchShield: architectural framework for assisting DRAM scaling by tolerating high error rates , 2013, ISCA.

[25]  Frederick A. Ware,et al.  Improving Power and Data Efficiency with Threaded Memory Modules , 2006, 2006 International Conference on Computer Design.

[26]  Christian Engelmann,et al.  Combining Partial Redundancy and Checkpointing for HPC , 2012, 2012 IEEE 32nd International Conference on Distributed Computing Systems.

[27]  Peter K. Pearson,et al.  Fast hashing of variable-length text strings , 1990, CACM.

[28]  Bronis R. de Supinski,et al.  MCREngine: A scalable checkpointing system using data-aware aggregation and compression , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[29]  Matti Tommiska,et al.  Hardware Implementation Analysis of the MD5 Hash Algorithm , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[30]  Russ Housley A 224-bit One-way Hash Function: SHA-224 , 2004, RFC.

[31]  Jeffrey F. Naughton,et al.  Low-Latency, Concurrent Checkpointing for Parallel Programs , 1994, IEEE Trans. Parallel Distributed Syst..

[32]  James H. Laros,et al.  Evaluating the viability of process replication reliability for exascale systems , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[33]  Fabrizio Petrini,et al.  On the feasibility of incremental checkpointing for scientific computing , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[34]  Laxmikant V. Kalé,et al.  ACR: Automatic checkpoint/restart for soft and hard error protection , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[35]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[36]  W. W. PETERSONt,et al.  Cyclic Codes for Error Detection * , 2022 .

[37]  David Locklear CHIPKILL CORRECT MEMORY ARCHITECTURE , 2000 .

[38]  L. Borucki,et al.  Comparison of accelerated DRAM soft error rates measured at component and system level , 2008, 2008 IEEE International Reliability Physics Symposium.

[39]  Donald E. Eastlake,et al.  US Secure Hash Algorithm 1 (SHA1) , 2001, RFC.

[40]  Pedro Isasi Viñuela,et al.  Performance of the most common non‐cryptographic hash functions , 2014, Softw. Pract. Exp..

[41]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .