DRAM errors in the wild: a large-scale field study

Errors in dynamic random access memory (DRAM) are a common form of hardware failure in modern compute clusters. Failures are costly both in terms of hardware replacement costs and service disruption. While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters. In this paper, we analyze measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years. The collected data covers multiple vendors, DRAM capacities and technologies, and comprises many millions of DIMM days. The goal of this paper is to answer questions such as the following: How common are memory errors in practice? What are their statistical properties? How are they affected by external factors, such as temperature and utilization, and by chip-specific factors, such as chip density, memory technology and DIMM age? We find that DRAM error behavior in the field differs in many key aspects from commonly held assumptions. For example, we observe DRAM error rates that are orders of magnitude higher than previously reported, with 25,000 to 70,000 errors per billion device hours per Mbit and more than 8% of DIMMs affected by errors per year. We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode. We find that temperature, known to strongly impact DIMM error rates in lab conditions, has a surprisingly small effect on error behavior in the field, when taking all other factors into account. Finally, unlike commonly feared, we don't observe any indication that newer generations of DIMMs have worse error behavior.

[1]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[2]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[3]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[4]  K. Shimohigashi,et al.  Origin and characteristics of alpha-particle-induced permanent junction leakage , 1990 .

[5]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[6]  E. Normand Single event upset at ground level , 1996 .

[7]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[8]  T. Hamamoto,et al.  On the retention time distribution of dynamic random access memory (DRAM) , 1998 .

[9]  A. Johnston Scaling and Technology Issues for Soft Error Rates , 2000 .

[10]  Alan Messer,et al.  Increasing relevance of memory hardware errors: a case for recoverable programming models , 2000, EW 9.

[11]  Zaid Al-Ars,et al.  Simulation based analysis of temperature effect on the faulty behavior of embedded DRAMs , 2001, Proceedings International Test Conference 2001 (Cat. No.01CH37260).

[12]  Ravishankar K. Iyer,et al.  An experimental study of security vulnerabilities caused by errors , 2001, 2001 International Conference on Dependable Systems and Networks.

[13]  Andrew W. Appel,et al.  Using memory errors to attack a virtual machine , 2003, 2003 Symposium on Security and Privacy, 2003..

[14]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[15]  Alan Messer,et al.  Susceptibility of commodity systems and software to memory soft errors , 2004, IEEE Transactions on Computers.

[16]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[17]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[18]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[19]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[20]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[21]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[22]  L. Borucki,et al.  Comparison of accelerated DRAM soft error rates measured at component and system level , 2008, 2008 IEEE International Reliability Physics Symposium.

[23]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[24]  Eduardo Pinheiro,et al.  DRAM errors in the wild , 2011, Commun. ACM.