Parity Lost and Parity Regained

RAID storage systems protect data from storage errors, such as data corruption, using a set of one or more integrity techniques, such as checksums. The exact protection offered by certain techniques or a combination of techniques is sometimes unclear. We introduce and apply a formal method of analyzing the design of data protection strategies. Specifically, we use model checking to evaluate whether common protection techniques used in parity-based RAID systems are sufficient in light of the increasingly complex failure modes of modern disk drives. We evaluate the approaches taken by a number of real systems under single-error conditions, and find flaws in every scheme. In particular, we identify a parity pollution problem that spreads corrupt data (the result of a single error) across multiple disks, thus leading to data loss or corruption. We further identify which protection measures must be used to avoid such problems. Finally, we show how to combine real-world failure data with the results from the model checker to estimate the actual likelihood of data loss of different protection strategies.

[1]  Michelle Y. Kim,et al.  Synchronized Disk Interleaving , 1986, IEEE Transactions on Computers.

[2]  Edmund M. Clarke,et al.  Model Checking , 1999, Handbook of Automated Reasoning.

[3]  Junfeng Yang,et al.  EXPLODE: a lightweight, general system for finding serious storage system errors , 2006, OSDI '06.

[4]  Jai Menon,et al.  Comparison of sparing alternatives for disk arrays , 1992, ISCA '92.

[5]  Peter M. Chen,et al.  Striping in a RAID level 5 disk array , 1995, SIGMETRICS '95/PERFORMANCE '95.

[6]  Wei Tu,et al.  Model checking an entire Linux distribution for security violations , 2005, 21st Annual Computer Security Applications Conference (ACSAC'05).

[7]  Jean Arlat,et al.  IEEE Transactions on Dependable and Secure Computing , 2006 .

[8]  Shankar Pasupathy,et al.  An analysis of latent sector errors in disk drives , 2007, SIGMETRICS '07.

[9]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[10]  Yogen K. Dalal,et al.  Pilot: an operating system for a personal computer , 1980, CACM.

[11]  Junfeng Yang,et al.  Using model checking to find serious file system errors , 2004, TOCS.

[12]  Cyril U. Orji,et al.  Doubly distorted mirrors , 1993, SIGMOD '93.

[13]  Fabrizio Lombardi,et al.  Detection of defective media in disks , 1993, Proceedings of 1993 IEEE International Workshop on Defect and Fault Tolerance in VLSI Systems.

[14]  Jehoshua Bruck,et al.  EVENODD: an optimal scheme for tolerating double disk failures in RAID architectures , 1994, ISCA '94.

[15]  Spencer W. Ng,et al.  Disk scrubbing in large archival storage systems , 2004, The IEEE Computer Society's 12th Annual International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunications Systems, 2004. (MASCOTS 2004). Proceedings..

[16]  Alan Rowe,et al.  Measuring Real-World Data Availability , 2001, LISA.

[17]  Garth A. Gibson,et al.  RAID: high-performance, reliable secondary storage , 1994, CSUR.

[18]  Dirk Beyer,et al.  Designing for Disasters , 2004, FAST.

[19]  Stephan Merz,et al.  Model Checking , 2000 .

[20]  Carl Staelin,et al.  The HP AutoRAID hierarchical storage system , 1995, SOSP.

[21]  Michael G. Pecht,et al.  Enhanced Reliability Modeling of RAID Storage Systems , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[22]  David A. Patterson,et al.  Towards Availability Benchmarks: A Case Study of Software RAID Systems , 2000, USENIX Annual Technical Conference, General Track.

[23]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[24]  Hannu H. Kari Latent Sector Faults and Reliability of Disk Arrays , 2005 .

[25]  Andrea C. Arpaci-Dusseau,et al.  IRON file systems , 2005, SOSP '05.

[26]  WilkesJohn,et al.  The HP AutoRAID hierarchical storage system , 1996 .

[27]  James Lee Hafner,et al.  Undetected disk errors in RAID arrays , 2008, IBM J. Res. Dev..

[28]  Margo I. Seltzer,et al.  Unifying File System Protection , 2001, USENIX Annual Technical Conference, General Track.

[29]  Dina Bitton,et al.  Disk Shadowing , 1988, VLDB.

[30]  H KatzRandy,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988 .

[31]  Erez Zadok,et al.  Ensuring data integrity in storage: techniques and applications , 2005, StorageSS '05.

[32]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[33]  Walter A. Burkhard,et al.  Disk array storage system reliability , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[34]  Andrea C. Arpaci-Dusseau,et al.  An analysis of data corruption in the storage stack , 2008, TOS.