A Review of Disc Scrubbing and Intra Disc Redundancy Techniques for Reducing Data Loss in Disc File Systems

of high demand that applications and new technologies have today for data storage capacity, more disk drives are needed, resulting in increased probability to inaccessible sectors, referred as Latent Sector Errors (LSE). Aiming to reduce data loss by LSE, two main techniques are extensively studied lately: Disk Scrubbing, which performs reading operations during idle periods on systems to search for errors and Intra Disk Redundancy which is based on redundancy codes. This paper reviews and discusses the problems of LSE and the main causes that lead to LSE, its properties and their correlation on nearline and enterprise disks. Focusing on reducing LSE with regards to security, processing overhead and disk space, we analyze and compare the latest techniques: Disc Scrubbing and Intra Disk Redundancy aiming to highlight the issues and challenges according to different statistical approaches. Furthermore, based on previous evaluation results, we discuss and introduce the benefits on using both schemes simultaneously: combining different IDR coding schemes with Accelerated Scrubbing and Staggered Scrubbing in particular regions of disc drives that store crucial data during idle periods. Finally, we discuss and evaluate from an extended statistical analysis the best ways on how reduce data loss with a minimum impact on system performance.

[1]  Alma Riska,et al.  Busy bee: how to use traffic information for better scheduling of background tasks , 2012, ICPE '12.

[2]  Peter F. Corbett,et al.  Awarded Best Paper! -- Row-Diagonal Parity for Double Disk Failure Correction , 2004 .

[3]  Ari Juels,et al.  A Clean-Slate Look at Disk Scrubbing , 2010, FAST.

[4]  Bianca Schroeder,et al.  Understanding latent sector errors and how to protect against them , 2010, TOS.

[5]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[6]  Pin Zhou,et al.  Evaluating the impact of Undetected Disk Errors in RAID systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[7]  Alma Riska,et al.  Enhancing data availability in disk drives through background activities , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[8]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[9]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[10]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[11]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[12]  André Brinkmann,et al.  Evaluation of Applied Intra-disk Redundancy Schemes to Improve Single Disk Reliability , 2011, 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems.

[13]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[14]  Qi Zhang,et al.  Efficient management of idleness in systems , 2007, SIGMETRICS '07.

[15]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[16]  James S. Plank XOR's, lower bounds and MDS codes for storage , 2011, 2011 IEEE Information Theory Workshop.

[17]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[18]  James Lee Hafner,et al.  Matrix methods for lost data reconstruction in erasure codes , 2005, FAST'05.

[19]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[20]  Ahmed Amer,et al.  Improving Disk Array Reliability Through Expedited Scrubbing , 2010, 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage.

[21]  Ke Zhou,et al.  Modeling the Impact of Disk Scrubbing on Storage System , 2010, J. Comput..

[22]  Jon G. Elerath Hard Disk Drives: The Good, the Bad and the Ugly! , 2007, ACM Queue.

[23]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[24]  Jon G. Elerath,et al.  Hard-disk drives: the good, the bad, and the ugly , 2009, CACM.

[25]  Evangelos Eleftheriou,et al.  Disk Scrubbing Versus Intradisk Redundancy for RAID Storage Systems , 2011, TOS.

[26]  James Lee Hafner,et al.  Undetected disk errors in RAID arrays , 2008, IBM J. Res. Dev..

[27]  Andrea C. Arpaci-Dusseau,et al.  Parity Lost and Parity Regained , 2008, FAST.

[28]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.

[29]  Mario Blaum,et al.  Higher reliability redundant disk arrays: Organization, operation, and coding , 2009, TOS.

[30]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.