A Rising Tide Lifts All Boats: How Memory Error Prediction and Prevention Can Help with Virtualized System Longevity

Memory is the most frequently failing component that can cause system crash, which significantly affects the emerging data centers that are based on system virtualization (e.g., clouds). Such environment differs from previously studied large systems and thus poses renewed challenge to the reliability, availability, and serviceability (RAS) of today's production site that hosts a large population of commodity servers. The paper advocates addressing this problem by exploiting memory error characteristics and employing a cost-effective self-healing mechanism. Specifically, we propose a memory error prediction and prevention model, which takes as input error events and system utilization, assesses memory error risk, and manipulates memory mappings accordingly (by page/DIMM replacement or VM live migration) to avoid potential damage and loss.

[1]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[2]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[3]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[4]  Hovav Shacham,et al.  Hey, you, get off of my cloud: exploring information leakage in third-party compute clouds , 2009, CCS.

[5]  Babak Falsafi,et al.  PAI: A Lightweight Mechanism for Single-Node Memory Recovery in DSM Servers , 2007, 13th Pacific Rim International Symposium on Dependable Computing (PRDC 2007).

[6]  Andrew W. Appel,et al.  Using memory errors to attack a virtual machine , 2003, 2003 Symposium on Security and Privacy, 2003..

[7]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[8]  James E. Smith,et al.  Implementing high availability memory with a duplication cache , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[9]  Xin Li,et al.  A Realistic Evaluation of Memory Hardware Errors and Software System Susceptibility , 2010, USENIX Annual Technical Conference.

[10]  Andrew Warfield,et al.  Live migration of virtual machines , 2005, NSDI.

[11]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[12]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[13]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[14]  Dong Tang,et al.  Assessment of the Effect of Memory Page Retirement on System RAS Against Hardware Faults , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[15]  Dutch T. Meyer,et al.  Remus: High Availability via Asynchronous Virtual Machine Replication. (Best Paper) , 2008, NSDI.

[16]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[17]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[18]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.