Modeling the Impact of Disk Scrubbing on Storage System

One of characteristics of the rising cloud storage technology is low-cost and high reliability. A distinct benefit of disk scanning or scrubbing operation is identifying the potential failure sectors as early as possible, thus providing high reliability. Obviously, the higher the scrubbing frequency is, the higher the system reliability is. However, it may take a few hours for a scanning process to check the whole disk. In other words, the scrubbing process may result in a downtime or a lower system performance. Furthermore, the scrubbing process consumes energy. In order to reduce the impact of disk scrubbing on disk performance and energy consumption, system designers choose to scan the disk in a low frequency, which results in a lower reliability. Thus it is essential to design a good scrubbing scheme in a large scale storage system over long time horizons. In this paper, we present a novel scrubbing scheme to solve the challenge. In this scheme, an optimum scrubbing cycle is decided by keeping a balance between data loss cost, scrubbing cost, and disk failure rate. Our research shows how the data price and the scrubbing cost affect scrubbing frequency, and the scrubbing scheme is applicable for storage with inexpensive data. Our experiment shows that our scheme outperforms routine method 73.3% in cost and 40% in reliability.

[1]  Gregory R. Ganger,et al.  Modeling the relative fitness of storage , 2007, SIGMETRICS '07.

[2]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[3]  Guanying Wang,et al.  On the Impact of Disk Scrubbing on Energy Savings , 2008, HotPower.

[4]  Ethan L. Miller,et al.  Pergamum: Replacing Tape with Energy Efficient, Reliable, Disk-Based Archival Storage , 2008, FAST.

[5]  Christos Faloutsos,et al.  Using Utility to Provision Storage Systems , 2008, FAST.

[6]  Yuhui Deng,et al.  A Novel Cost-Effective Disk Scrubbing Scheme , 2009, 2009 Fifth International Joint Conference on INC, IMS and IDC.

[7]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[8]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[9]  Yuhui Deng,et al.  Exploiting the performance gains of modern disk drives by enhancing data locality , 2009, Inf. Sci..

[10]  Eduardo Pinheiro,et al.  Failure Trends in a Large Disk Drive Population , 2007, FAST.

[11]  Rajarshi Das,et al.  Utility functions in autonomic systems , 2004 .

[12]  Kang G. Shin,et al.  FS2: dynamic data replication in free disk space for improving disk performance and energy consumption , 2005, SOSP '05.

[13]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[14]  Mary Baker,et al.  A fresh look at the reliability of long-term digital storage , 2005, EuroSys.

[15]  Eric Anderson,et al.  Quickly finding near-optimal storage designs , 2005, TOCS.

[16]  Hannu H. Kari Latent Sector Faults and Reliability of Disk Arrays , 2005 .

[17]  J. G. Elerath Specifying reliability in the disk drive industry: No more MTBF's , 2000, Annual Reliability and Maintainability Symposium. 2000 Proceedings. International Symposium on Product Quality and Integrity (Cat. No.00CH37055).

[18]  Bianca Schroeder,et al.  Disk Failures in the Real World: What Does an MTTF of 1, 000, 000 Hours Mean to You? , 2007, FAST.

[19]  Jeffrey O. Kephart,et al.  An artificial intelligence perspective on autonomic computing policies , 2004, Proceedings. Fifth IEEE International Workshop on Policies for Distributed Systems and Networks, 2004. POLICY 2004..

[20]  Evangelos Eleftheriou,et al.  Disk scrubbing versus intra-disk redundancy for high-reliability raid storage systems , 2008, SIGMETRICS '08.

[21]  Eitan Bachmat,et al.  Analysis of methods for scheduling low priority disk drive tasks , 2002, SIGMETRICS '02.

[22]  William J. Bolosky,et al.  A large-scale study of file-system contents , 1999, SIGMETRICS '99.

[23]  Arkady Kanevsky,et al.  Are disks the dominant contributor for storage failures?: A comprehensive study of storage subsystem failure characteristics , 2008, TOS.

[24]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[25]  Ajay Dholakia,et al.  A new intra-disk redundancy scheme for high-reliability RAID storage systems in the presence of unrecoverable errors , 2006, TOS.