A Clean-Slate Look at Disk Scrubbing

A number of techniques have been proposed to reduce the risk of data loss in hard-drives, from redundant disks (e.g., RAID systems) to error coding within individual drives. Disk scrubbing is a background process that reads disks during idle periods to detect irremediable read errors in infrequently accessed sectors. Timely detection of such latent sector errors (LSEs) is important to reduce data loss. In this paper, we take a clean-slate look at disk scrubbing. We present the first formal definition in the literature of a scrubbing algorithm, and translate recent empirical results on LSE distributions into new scrubbing principles. We introduce a new simulation model for LSE incidence in disks that allows us to optimize our proposed scrubbing techniques and demonstrate the significant benefits of intelligent scrubbing to drive reliability. We show how optimal scrubbing strategies depend on disk characteristics (e.g., the BER rate), as well as disk workloads.

[1]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[2]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[3]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[4]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Hannu H. Kari Latent Sector Faults and Reliability of Disk Arrays , 2005 .

[6]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[7]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[8]  Alma Riska,et al.  Disk Drive Level Workload Characterization , 2006, USENIX Annual Technical Conference, General Track.

[9]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[10]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[11]  Alan Jay Smith,et al.  Characteristics of I/O traffic in personal computer and server workloads , 2002, IBM Syst. J..

[12]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[13]  Erik Riedel,et al.  A performance study of sequential I/O on windows NT TM 4 , 1998 .

[14]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[15]  Ningfang Mi,et al.  Enhancing Data Availability through Background Activities , .

[16]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[17]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).