Determining Fault Tolerance of XOR-Based Erasure Codes Efficiently

We propose a new fault tolerance metric for XOR-based erasure codes: the minimal erasures list (MEL). A minimal erasure is a set of erasures that leads to irrecoverable data loss and in which every erasure is necessary and sufficient for this to be so. The MEL is the enumeration of all minimal erasures. An XOR-based erasure code has an irregular structure that may permit it to tolerate faults at and beyond its Hamming distance. The MEL completely describes the fault tolerance of an XOR-based erasure code at and beyond its Hamming distance; it is therefore a useful metric for comparing the fault tolerance of such codes. We also propose an algorithm that efficiently determines the MEL of an erasure code. This algorithm uses the structure of the erasure code to efficiently determine the MEL. We show that, in practice, the number of minimal erasures for a given code is much less than the total number of sets of erasures that lead to data loss: in our empirical results for one corpus of codes, there were over 80 times fewer minimal erasures. We use the proposed algorithm to identify the most fault tolerant XOR-based erasure code for all possible systematic erasure codes with up to seven data symbols and up to seven parity symbols.

[1]  James S. Plank,et al.  A practical analysis of low-density parity-check erasure codes for wide-area storage applications , 2004, International Conference on Dependable Systems and Networks, 2004.

[2]  Randy H. Katz,et al.  A case for redundant arrays of inexpensive disks (RAID) , 1988, SIGMOD '88.

[3]  James Lee Hafner,et al.  WEAVER codes: highly fault tolerant erasure codes for storage systems , 2005, FAST'05.

[4]  Jehoshua Bruck,et al.  EVENODD: An Efficient Scheme for Tolerating Double Disk Failures in RAID Architectures , 1995, IEEE Trans. Computers.

[5]  Arif Merchant,et al.  FAB: building distributed enterprise disk arrays from commodity components , 2004, ASPLOS XI.

[6]  J. Plank Optimizing Cauchy Reed-Solomon Codes for Fault-Tolerant Storage Applications , 2005 .

[7]  James Lee Hafner,et al.  Matrix methods for lost data reconstruction in erasure codes , 2005, FAST'05.

[8]  Daniel A. Spielman,et al.  Practical loss-resilient codes , 1997, STOC '97.

[9]  James S. Plank,et al.  Small parity-check erasure codes - exploration and observations , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[10]  Alexander Vardy,et al.  On the stopping distance and the stopping redundancy of codes , 2006, IEEE Transactions on Information Theory.

[11]  Peter F. Corbett,et al.  Row-Diagonal Parity for Double Disk Failure Correction (Awarded Best Paper!) , 2004, USENIX Conference on File and Storage Technologies.

[12]  The 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2007, 25-28 June 2007, Edinburgh, UK, Proceedings , 2007, DSN.

[13]  Robert G. Gallager,et al.  Low-density parity-check codes , 1962, IRE Trans. Inf. Theory.

[14]  James Lee Hafner,et al.  HoVer Erasure Codes For Disk Arrays , 2006, International Conference on Dependable Systems and Networks (DSN'06).