RAMpage: Graceful Degradation Management for Memory Errors in Commodity Linux Servers

Memory errors are a major source of reliability problems in current computers. Undetected errors may result in program termination, or, even worse, silent data corruption. Recent studies have shown that the frequency of permanent memory errors is an order of magnitude higher than previously assumed and regularly affects everyday operation. Often, neither additional circuitry to support hardware-based error detection nor downtime for performing hardware tests can be afforded. In the case of permanent memory errors, a system faces two challenges: detecting errors as early as possible and handling them while avoiding system downtime. To increase system reliability, we have developed RAMpage, an online memory testing infrastructure for commodity x86-64-based Linux servers, which is capable of efficiently detecting memory errors and which provides graceful degradation by withdrawing affected memory pages from further use. We describe the design and implementation of RAMpage and present results of an extensive qualitative as well as quantitative evaluation.

[1]  John R. Douceur,et al.  Cycles, cells and platters: an empirical analysisof hardware failures on a million consumer PCs , 2011, EuroSys '11.

[2]  Mel Gorman,et al.  Understanding the Linux Virtual Memory Manager , 2004 .

[3]  D. DeMets,et al.  Data integrity. , 2020, Controlled clinical trials.

[4]  Vason P. Srini Fault Location in a Semiconductor Random-Access Memory Unit , 1978, IEEE Transactions on Computers.

[5]  Mario Dal Cin,et al.  Evaluating fault-tolerant system designs using FAUmachine , 2007, EFTS '07.

[6]  Eduardo Pinheiro,et al.  DRAM errors in the wild: a large-scale field study , 2009, SIGMETRICS '09.

[7]  John P. Hayes,et al.  Detection oF Pattern-Sensitive Faults in Random-Access Memories , 1975, IEEE Transactions on Computers.

[8]  YoonDoe Hyun,et al.  Virtualized and flexible ECC for main memory , 2010 .

[9]  Jacob A. Abraham,et al.  Efficient Algorithms for Testing Semiconductor Random-Access Memories , 1978, IEEE Transactions on Computers.

[10]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[11]  Janak H. Patel,et al.  Diagnosis and Repair of Memory with Coupling Faults , 1989, IEEE Trans. Computers.

[12]  Amandeep Singh,et al.  Software based in-system memory test for highly available systems , 2005, 2005 IEEE International Workshop on Memory Technology, Design, and Testing (MTDT'05).

[13]  Doe Hyun Yoon,et al.  Virtualized and flexible ECC for main memory , 2010, ASPLOS XV.

[14]  Jeff Bonwick,et al.  The Slab Allocator: An Object-Caching Kernel Memory Allocator , 1994, USENIX Summer.