Understanding disk failure rates: What does an MTTF of 1,000,000 hours mean to you?

Component failure in large-scale IT installations is becoming an ever-larger problem as the number of components in a single cluster approaches a million. This article is an extension of our previous study on disk failures [Schroeder and Gibson 2007] and presents and analyzes field-gathered disk replacement data from a number of large production systems, including high-performance computing sites and internet services sites. More than 110,000 disks are covered by this data, some for an entire lifetime of five years. The data includes drives with SCSI and FC, as well as SATA interfaces. The mean time-to-failure (MTTF) of those drives, as specified in their datasheets, ranges from 1,000,000 to 1,500,000 hours, suggesting a nominal annual failure rate of at most 0.88%. We find that in the field, annual disk replacement rates typically exceed 1%, with 2--4% common and up to 13% observed on some systems. This suggests that field replacement is a fairly different process than one might predict based on datasheet MTTF. We also find evidence, based on records of disk replacements in the field, that failure rate is not constant with age, and that rather than a significant infant mortality effect, we see a significant early onset of wear-out degradation. In other words, the replacement rates in our data grew constantly with age, an effect often assumed not to set in until after a nominal lifetime of 5 years. Interestingly, we observe little difference in replacement rates between SCSI, FC, and SATA drives, potentially an indication that disk-independent factors such as operating conditions affect replacement rates more than component-specific ones. On the other hand, we see only one instance of a customer rejecting an entire population of disks as a bad batch, in this case because of media error rates, and this instance involved SATA disks. Time between replacement, a proxy for time between failure, is not well modeled by an exponential distribution and exhibits significant levels of correlation, including autocorrelation and long-range dependence.

[1]  Sheldon M. Ross,et al.  Introduction to Probability Models, Eighth Edition , 1972 .

[2]  Sheldon M. Ross,et al.  Introduction to probability models , 1975 .

[3]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[4]  Ravishankar K. Iyer,et al.  Measurement and modeling of computer reliability as affected by system activity , 1986, TOCS.

[5]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[6]  Lu Wei,et al.  Analysis of workload influence on dependability , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[8]  Jim Gray,et al.  A census of Tandem system availability between 1985 and 1990 , 1990 .

[9]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[10]  Garth A. Gibson Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis , 1990 .

[11]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[12]  Walter Willinger,et al.  On the self-similar nature of Ethernet traffic , 1993, SIGCOMM '93.

[13]  Brendan Murphy,et al.  Measuring system and software reliability using an automated data collection process , 1995 .

[14]  Ravishankar K. Iyer,et al.  Networked Windows NT system field failure data analysis , 1999, Proceedings 1999 Pacific Rim International Symposium on Dependable Computing.

[15]  Feng-Bin Sun,et al.  A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[16]  David A. Patterson,et al.  An Analysis of Error Behaviour in a Large Storage System , 1999 .

[17]  Ravishankar K. Iyer,et al.  Failure data analysis of a LAN of Windows NT based computers , 1999, Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems.

[18]  J. G. Elerath Specifying reliability in the disk drive industry: No more MTBF's , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[19]  J. G. Elerath AFR: problems of definition, calculation and measurement in a commercial environment , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[20]  Thomas Karagiannis SELFIS : A Short Tutorial , 2002 .

[21]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[22]  Howard Gobioff,et al.  The Google file system , 2003, SOSP '03.

[23]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[24]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .

[25]  GhemawatSanjay,et al.  The Google file system , 2003 .

[26]  Peter F. Corbett,et al.  Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction , 2004 .

[27]  S. Shah,et al.  Server class disk drives: how reliable are they? , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[28]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[29]  Michalis Faloutsos,et al.  Long-range dependence ten years of Internet traffic modeling , 2004, IEEE Internet Computing.

[30]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[31]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[32]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[33]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[34]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[35]  Jim Gray,et al.  Empirical Measurements of Disk Failure Rates and Error Rates , 2007, ArXiv.

[36]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.