An empirical study of memory hardware errors in a server farm

The integrity of system hardware is an important requirement for providing dependable services. Understanding the hardware's failure mechanisms and the error rate is therefore an important step towards devising an effective overall protection mechanism to prevent service failure. In this paper we discuss an on-going case study of memory hardware failures of production systems in a server-farm environment. We present some preliminary results collected from 212 machines. Our observations under a normal, nonaccelerated condition validate the existence of all failure modes modeled in the previous literature: single-cell, row, column, and whole-chip failures. We also provide a quantitative analysis of the error rates.

[1]  J. Black Mass Transport of Aluminum by Momentum Exchange with Conducting Electrons , 1967 .

[2]  James L. Walsh,et al.  Field testing for cosmic ray soft errors in semiconductor memories , 1996, IBM J. Res. Dev..

[3]  Mario Blaum,et al.  The Reliability of Single-Error Protected Computer Memories , 1988, IEEE Trans. Computers.

[4]  Timothy J. Dell,et al.  A white paper on the benefits of chipkill-correct ecc for pc server main memory , 1997 .

[5]  J. Black Mass transport of aluminum by momentum exchange with conducting electrons , 1967, 2005 IEEE International Reliability Physics Symposium, 2005. Proceedings. 43rd Annual..

[6]  Cristian Constantinescu,et al.  Impact of deep submicron technology on dependability of VLSI circuits , 2002, Proceedings International Conference on Dependable Systems and Networks.

[7]  J. Ziegler,et al.  Effect of Cosmic Rays on Computer Memories , 1979, Science.

[8]  Douglas M. Blough On the reconfiguration of memory arrays containing clustered faults , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[9]  Xin Li,et al.  A Memory Soft Error Measurement on Production Systems , 2007, USENIX Annual Technical Conference.

[10]  T. May,et al.  Alpha-particle-induced soft errors in dynamic memories , 1979, IEEE Transactions on Electron Devices.

[11]  Daniel P. Siewiorek,et al.  Reliability and Performance of Error-Correcting Memory and Register Arrays , 1980, IEEE Transactions on Computers.

[12]  James L. Walsh,et al.  IBM experiments in soft fails in computer electronics (1978-1994) , 1996, IBM J. Res. Dev..

[13]  Kewal K. Saluja,et al.  Pattern sensitive fault testing of RAMs with built-in ECC , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.