A Large-Scale Study of I/O Workload’s Impact on Disk Failure

In large-scale data centers, disk failure is the norm rather than an exception. Frequent disk failure noticeably hurts user experience and results in unavailability of data in the worst case. Previous researches from both industry and academia have studied the reasons of disk failure; however, there is a lack of knowledge of the intrinsic relation between failed disks and their I/O workload. In this paper, we collect and investigate about four billion drive hours I/O traces over 500 000 disks in Tencent’s data centers. Our focus is to first exploit the key characteristics of I/O workload that influences disk reliability. We further present the impact of these I/O workload features on lifespan of disks and uncover the root causes. Finally, we introduce a new metric to accurately identify the ”dangerous” I/O workload which is extremely harmful to disk health. To the best of our knowledge, this research is by far the first in-depth analysis of the I/O workload’s impact on disk reliability and opens up a new dimension for I/O schedule policy in data centers.

[1]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[2]  Graeme R. Cole Estimating Drive Reliability in Desktop Computers and Consumer Electronics , 2003 .

[3]  Tommy W. S. Chow,et al.  A Two-Step Parametric Method for Failure Prediction in Hard Disk Drives , 2014, IEEE Transactions on Industrial Informatics.

[4]  Gang Wang,et al.  Proactive drive failure prediction for large scale storage systems , 2013, 2013 IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST).

[5]  Mario Blaum,et al.  Sector-Disk (SD) Erasure Codes for Mixed Failure Modes in RAID Systems , 2014, TOS.

[6]  Robert Birke,et al.  Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[7]  S. Shah,et al.  Server class disk drives: how reliable are they? , 2004, Annual Symposium Reliability and Maintainability, 2004 - RAMS.

[8]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[9]  Jiqiang Liu,et al.  Public verifiability for shared data in cloud storage with a defense against collusion attacks , 2016, Science China Information Sciences.

[10]  Jiwu Shu,et al.  Load-Balanced Recovery Schemes for Single-Disk Failure in Storage Systems with Any Erasure Code , 2013, 2013 42nd International Conference on Parallel Processing.

[11]  Erik Riedel,et al.  More Than an Interface - SCSI vs. ATA , 2003, FAST.

[12]  Hai Jin,et al.  Disk Failure Prediction in Data Centers via Online Learning , 2018, ICPP.

[13]  Feng-Bin Sun,et al.  A comprehensive review of hard-disk drive reliability , 1999, Annual Reliability and Maintainability. Symposium. 1999 Proceedings (Cat. No.99CH36283).

[14]  Anand Sivasubramaniam,et al.  Understanding the performance-temperature interactions in disk I/O of server workloads , 2006, The Twelfth International Symposium on High-Performance Computer Architecture, 2006..

[15]  Mingqiang Li,et al.  Toward I/O-efficient protection against silent data corruptions in RAID arrays , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[16]  J. Sikora Disk failures in the real world : What does an MTTF of 1 , 000 , 000 hours mean to you ? , 2007 .

[17]  Sriram Sankar,et al.  Datacenter Scale Evaluation of the Impact of Temperature on Hard Disk Drive Failures , 2013, TOS.

[18]  Fred Douglis,et al.  RAIDShield: Characterizing, Monitoring, and Proactively Protecting Against Disk Failures , 2015, FAST.

[19]  Sriram Sankar,et al.  Environmental Conditions and Disk Reliability in Free-cooled Datacenters , 2016, USENIX Annual Technical Conference.

[20]  Roger Faulkner,et al.  The Process File System and Process Model in UNIX System V , 1991, USENIX Winter.

[21]  Gang Wang,et al.  Hard Drive Failure Prediction Using Classification and Regression Trees , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[22]  Bianca Schroeder,et al.  Temperature management in data centers: why some (might) like it hot , 2012, SIGMETRICS '12.

[23]  Ku-Young Chang,et al.  Bi-directional and concurrent proof of ownership for stronger storage services with de-duplication , 2017, Science China Information Sciences.