Enhancing data availability in disk drives through background activities

Latent sector errors in disk drives affect only a few data sectors. They occur silently and are detected only when the affected area is accessed again. If a latent error is detected while the storage system is operating under reduced redundancy, i.e., during a RAID rebuild, then data loss may occur. Various features such as scrubbing and intra-disk data redundancy are proposed to detect and/or recover from latent errors and avoid data loss. While such features enhance data availability in the storage system, their execution may cause performance degradation. In this paper, we evaluate the effectiveness of scrubbing and intra-disk data redundancy in improving data availability while the overall goal is to maintain user performance within predefined bounds. We show that by treating them as low priority background activities and scheduling them efficiently during idle times, these features remain performance-wise transparent to the storage system user while still improving data reliability. Detailed trace-driven simulations show that the mean time to data loss (MTTDL) improves by up to 5 orders of magnitude if these features are implemented independently. By scheduling concurrently both scrubbing and intra-disk parity updates during idle times in disk drives, MTTDL improves by as much as 8 orders of magnitude.

[1]  C. Lueth RAID-DPTM: NETWORK APPLIANCETM IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION , 2006 .

[2]  Qi Zhang,et al.  Efficient management of idleness in systems , 2007, SIGMETRICS '07.

[3]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[4]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[5]  Qi Zhang,et al.  Evaluating the Performability of Systems with Background Jobs , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[6]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[7]  Joseph F. Murray,et al.  Reliability and security of RAID storage systems and D2D archives using SATA disk drives , 2005, TOS.

[8]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[9]  Fred Douglis,et al.  Adaptive Disk Spin-Down Policies for Mobile Computers , 1995, Comput. Syst..

[10]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[11]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  N BairavasundaramLakshmi,et al.  An analysis of latent sector errors in disk drives , 2007 .

[13]  Daniel Zappala,et al.  Cluster Computing on the Fly : P 2 P Scheduling of Idle Cycles in the Internet , 2004 .

[14]  Kang G. Shin,et al.  FS2: dynamic data replication in free disk space for improving disk performance and energy consumption , 2005, SOSP '05.

[15]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[16]  Joseph D. Touch,et al.  Idletime scheduling with preemption intervals , 2005, SOSP '05.

[17]  ZhangQi,et al.  Efficient management of idleness in systems , 2007 .

[18]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[19]  Darrell D. E. Long,et al.  Adaptive disk spin‐down for mobile computers , 2000, Mob. Networks Appl..

[20]  Carl Staelin,et al.  Idleness is Not Sloth , 1995, USENIX.

[21]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[22]  GhemawatSanjay,et al.  The Google file system , 2003 .

[23]  Pau Marti,et al.  Efficient Utilization of Bus Idle Times in CAN-based Networked Control Systems , 2010 .