Enhancing Data Availability through Background Activities

Latent sector errors in disk drives affect only a few data sectors and are often not detected till the affected data is accessed again. They may cause data loss if the storage system is operating under reduced redundancy, because of previous failures. In this paper, we evaluate effectiveness of two known techniques to detect and/or recover from latent sector errors, namely scrubbing and intra-disk data redundancy. These two techniques are treated as background activities that complete without affecting the otherwise normal operation of the storage system. We focus on how disk idle times can be managed to effectively complete these background tasks without affecting foreground task performance, while reducing the window of vulnerability for data loss. We show via detailed trace-driven simulations that scheduling policies for background jobs that are based on careful monitoring of the stochastic characteristics of idle times in disk drive, have a minimal effect on foreground task performance while dramatically improving storage system reliability.

[1]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[2]  Ajay Dholakia,et al.  Analysis of a new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, SIGMETRICS/Performance.

[3]  Carl Staelin,et al.  Idleness is Not Sloth , 1995, USENIX.

[4]  S. Shah,et al.  Reliability analysis of disk drive failure mechanisms , 2005, Annual Reliability and Maintainability Symposium, 2005. Proceedings..

[5]  C. Lueth RAID-DPTM: NETWORK APPLIANCETM IMPLEMENTATION OF RAID DOUBLE PARITY FOR DATA PROTECTION , 2006 .

[6]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[7]  Darrell D. E. Long,et al.  Adaptive disk spin‐down for mobile computers , 2000, Mob. Networks Appl..

[8]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[9]  Joseph F. Murray,et al.  Reliability and security of RAID storage systems and D2D archives using SATA disk drives , 2005, TOS.

[10]  Qi Zhang,et al.  Evaluating the Performability of Systems with Background Jobs , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[11]  Daniel Zappala,et al.  Cluster Computing on the Fly : P 2 P Scheduling of Idle Cycles in the Internet , 2004 .

[12]  Zhisheng Niu,et al.  A vacation queue with setup and close-down times and batch Markovian arrival processes , 2003, Perform. Evaluation.

[13]  Kang G. Shin,et al.  FS2: dynamic data replication in free disk space for improving disk performance and energy consumption , 2005, SOSP '05.

[14]  Michael K. Reiter,et al.  Lazy verification in fault-tolerant distributed storage systems , 2005, 24th IEEE Symposium on Reliable Distributed Systems (SRDS'05).

[15]  Attahiru Sule Alfa,et al.  A vacation model for the non-saturated Readers and Writers system with a threshold policy , 2002, Perform. Evaluation.

[16]  Fred Douglis,et al.  Adaptive Disk Spin-Down Policies for Mobile Computers , 1995, Comput. Syst..

[17]  Eitan Bachmat,et al.  Analysis of methods for scheduling low priority disk drive tasks , 2002, SIGMETRICS '02.

[18]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[19]  Pau Marti,et al.  Efficient Utilization of Bus Idle Times in CAN-based Networked Control Systems , 2010 .

[20]  Miron Livny,et al.  Condor-a hunter of idle workstations , 1988, [1988] Proceedings. The 8th International Conference on Distributed.

[21]  GhemawatSanjay,et al.  The Google file system , 2003 .

[22]  Joseph D. Touch,et al.  Idletime scheduling with preemption intervals , 2005, SOSP '05.