COMeT+: Continuous Online Memory Testing with Multi-Threading Extension

Today's computers have gigabytes of main memory due to improved DRAM density. As density increases, smaller bit cells become more susceptible to errors. With an increase in error susceptibility, the need for memory resiliency also increases. Self-testing of memory health can proactively check for errors to improve resiliency. This paper describes a software-only self-test to continuously test memory. We present the challenges and design for an approach, called Continuous Online Memory Testing with Multi-threading Extension (COMeT+), that targets chip multiprocessors. COMeT+ tests memory health simultaneously with execution of single and multi-threaded applications in anticipation of allocation requests. The approach guarantees that memory is tested within a fixed time interval to limit exposure to lurking errors. We developed and evaluated an implementation of COMeT+. On the SPEC CPU2006 and the PARSEC benchmarks, COMeT+ has a low 4% average performance overhead. On the PARSEC benchmarks, the effect of TLB shootdowns on application performance due to additional page migrations caused by COMeT+ was insignificant. When emulated errors were injected into physical memory, applications executed 1.13× to 4.41× longer with COMeT+ than without it.

[1]  Song Liu,et al.  Flikker: saving DRAM refresh-power through critical data partitioning , 2011, ASPLOS XVI.

[2]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[3]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[4]  Zhao Zhang,et al.  Mini-rank: Adaptive DRAM architecture for improving memory power efficiency , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[5]  Michael Engel,et al.  RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[6]  Ad J. van de Goor,et al.  March tests for word-oriented memories , 1998, Proceedings Design, Automation and Test in Europe.

[7]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[8]  Onur Mutlu,et al.  Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[9]  Xiaodong Li,et al.  Online Estimation of Architectural Vulnerability Factor for Soft Errors , 2008, 2008 International Symposium on Computer Architecture.

[10]  Bruce R. Childers,et al.  StealthWorks: Emulating Memory Errors , 2010, RV.

[11]  Bianca Schroeder,et al.  Cosmic rays don't strike twice: understanding the nature of DRAM errors and the implications for system design , 2012, ASPLOS XVII.

[12]  Bruce R. Childers,et al.  COMeT: Continuous Online Memory Test , 2011, 2011 IEEE 17th Pacific Rim International Symposium on Dependable Computing.

[13]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[14]  Amandeep Singh,et al.  Software based in-system memory test for highly available systems , 2005, 2005 IEEE International Workshop on Memory Technology, Design, and Testing (MTDT'05).

[15]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[16]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[17]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[18]  D. Tavangarian,et al.  Automatic on-line memory tests in workstations , 1994, Proceedings of IEEE International Workshop on Memory Technology, Design, and Test.

[19]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[20]  Pedro Reviriego,et al.  Optimizing Scrubbing Sequences for Advanced Computer Memories , 2010, IEEE Transactions on Device and Materials Reliability.

[21]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[22]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[23]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[24]  Shekhar Y. Borkar,et al.  Microarchitecture and Design Challenges for Gigascale Integration , 2004, MICRO.

[25]  R. Brett Tremaine,et al.  Durable memory RS/6000 system design , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[26]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[27]  Margaret Martonosi,et al.  Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors , 2009, 2009 18th International Conference on Parallel Architectures and Compilation Techniques.

[28]  Onur Mutlu,et al.  Operating system scheduling for efficient online self-test in robust systems , 2009, 2009 IEEE/ACM International Conference on Computer-Aided Design - Digest of Technical Papers.

[29]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[30]  Cheng-Wen Wu,et al.  Simulation-based test algorithm generation for random access memories , 2000, Proceedings 18th IEEE VLSI Test Symposium.

[31]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[32]  Amin Ansari,et al.  Adaptive online testing for efficient hard fault detection , 2009, 2009 IEEE International Conference on Computer Design.

[33]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[34]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[35]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.